Skip to main content

2015 | Buch

Data Science, Learning by Latent Structures, and Knowledge Discovery

herausgegeben von: Berthold Lausen, Sabine Krolak-Schwerdt, Matthias Böhmer

Verlag: Springer Berlin Heidelberg

Buchreihe : Studies in Classification, Data Analysis, and Knowledge Organization

insite
SUCHEN

Über dieses Buch

This volume comprises papers dedicated to data science and the extraction of knowledge from many types of data: structural, quantitative, or statistical approaches for the analysis of data; advances in classification, clustering and pattern recognition methods; strategies for modeling complex data and mining large data sets; applications of advanced methods in specific domains of practice. The contributions offer interesting applications to various disciplines such as psychology, biology, medical and health sciences; economics, marketing, banking and finance; engineering; geography and geology; archeology, sociology, educational sciences, linguistics and musicology; library science. The book contains the selected and peer-reviewed papers presented during the European Conference on Data Analysis (ECDA 2013) which was jointly held by the German Classification Society (GfKl) and the French-speaking Classification Society (SFC) in July 2013 at the University of Luxembourg.

Inhaltsverzeichnis

Frontmatter

Invited Papers

Frontmatter
Modernising Official Statistics: A Complex Challenge

In Europe, national statistical organisations and Eurostat, the statistical office of the European Union, produce and disseminate official statistics. These organisations come together as partners in the European Statistical System (ESS). This paper describes the ESS, the challenges it faces and the modernisation efforts that have been undertaken based on a redesigned ESS enterprise architecture. It also outlines the probable future direction of the ESS.

August Götzfried
A New Supervised Classification of Credit Approval Data via the Hybridized RBF Neural Network Model Using Information Complexity

In this paper, we introduce a new approach for supervised classification to handle mixed-data (i.e., categorical, binary, and continuous) data structures using a hybrid radial basis function neural networks (HRBF-NN). HRBF-NN supervised classification combines regression trees, ridge regression, and the genetic algorithm (GA) with radial basis function (RBF) neural networks (NN) along with information complexity (ICOMP) criterion as the fitness function to carry out both classification and subset selection of best predictors which discriminate between the classes. In this manner, we reduce the dimensionality of the data and at the same time improve classification accuracy of the fitted predictive model. We apply HRBF-NN supervised classification to a real benchmark credit approval mixed-data set to classify the customers into good/bad classes for credit approval. Our results show the excellent performance of HRBF-NN method in supervised classification tasks.

Oguz Akbilgic, Hamparsum Bozdogan
Finding the Number of Disparate Clusters with Background Contamination

The Forward Search is used in an exploratory manner, with many random starts, to indicate the number of clusters and their membership in continuous data. The prospective clusters can readily be distinguished from background noise and from other forms of outliers. A confirmatory Forward Search, involving control on the sizes of statistical tests, establishes precise cluster membership. The method performs as well as robust methods such as TCLUST. However, it does not require prior specification of the number of clusters, nor of the level of trimming of outliers. In this way it is “user friendly”.

Anthony C. Atkinson, Andrea Cerioli, Gianluca Morelli, Marco Riani
Clustering of Solar Irradiance

The development of grid-connected photovoltaic power systems leads to new challenges. The short or medium term prediction of the solar irradiance is definitively a solution to reduce the storage capacities and, as a result, authorizes to increase the penetration of the photovoltaic units on the power grid. We present the first results of an interdisciplinary research project which involves researchers in energy, meteorology, and data mining, addressing this real-world problem. In Reunion Island from December 2008 to March 2012, solar radiation measurements have been collected, every minute, using calibrated instruments. Prior to prediction modelling, two clustering strategies have been applied for the analysis of the data base of 951 days. The first approach combines the following proven data-mining methods. principal component analysis (PCA) was used as a pre-process for reduction and denoising and the Ward Hierarchical and K-means methods to find a partition with a good number of classes. The second approach uses a clustering method that operates on a set of dissimilarity matrices. Each cluster is represented by an element or a subset of the set of objects to be classified. The five meaningfully clusters found by the two clustering approaches are compared. The interest and disadvantages of the two approaches for classifying curves are discussed.

Miloud Bessafi, Franciscos de A.T. de Carvalho, Philippe Charton, Mathieu Delsaut, Thierry Despeyroux, Patrick Jeanty, Jean Daniel Lan-Sun-Luk, Yves Lechevallier, Henri Ralambondrainy, Lionel Trovalet

Data Science and Clustering

Frontmatter
Factor Analysis of Local Formalism

Local formalism deals with weighted unoriented networks, specified by an exchange matrix, determining the selection probabilities of pairs of vertices. It permits to define local inertia and local autocorrelation relatively to arbitrary networks. In particular, free partitioned exchanges amount in defining a categorical variable (hard membership), together with canonical spectral scores, identical to Fisher’s discriminant functions. One demonstrates how to extend the construction of the latter to any unoriented network, and how to assess the similarity between canonical and original configurations, as illustrated on four datasets.

François Bavaud, Christelle Cocco
Recent Progress in Complex Network Analysis: Models of Random Intersection Graphs

Experimental results show that in large complex networks such as Internet or biological networks, there is a tendency to connect elements which have a common neighbor. This tendency in theoretical random graph models is depicted by the asymptotically constant clustering coefficient. Moreover complex networks have power law degree distribution and small diameter (small world phenomena), thus these are desirable features of random graphs used for modeling real life networks. We survey various variants of random intersection graph models, which are important for networks modeling.

Mindaugas Bloznelis, Erhard Godehardt, Jerzy Jaworski, Valentas Kurauskas, Katarzyna Rybarczyk
Recent Progress in Complex Network Analysis: Properties of Random Intersection Graphs

Experimental results show that in large complex networks (such as internet, social or biological networks) there exists a tendency to connect elements which have a common neighbor. In theoretical random graph models, this tendency is described by the clustering coefficient being bounded away from zero. Complex networks also have power-law degree distributions and short average distances (small world phenomena). These are desirable features of random graphs used for modeling real life networks. We survey recent results concerning various random intersection graph models showing that they have tunable clustering coefficient, a rich class of degree distributions including power-laws, and short average distances.

Mindaugas Bloznelis, Erhard Godehardt, Jerzy Jaworski, Valentas Kurauskas, Katarzyna Rybarczyk
Similarity Measures of Concept Lattices

Concept lattices fulfil one of the aims of classification by providing a description by attributes of each class of objects. We introduce here two new similarity/dissimilarity measures: a similarity measure between concepts (elements) of a lattice and a dissimilarity measure between concept lattices defined on the same set of objects and attributes. Both measures are based on the overhanging relation previously introduced by the author, which are a cryptomorphism of lattices.

Florent Domenach
Flow-Based Dissimilarities: Shortest Path, Commute Time, Max-Flow and Free Energy

Random-walk based dissimilarities on weighted networks have demonstrated their efficiency in clustering algorithms. This contribution considers a few alternative network dissimilarities, among which a new max-flow dissimilarity, and more general

flow-based dissimilarities

, freely mixing shortest paths and random walks in function of a free parameter—the temperature. Their geometrical properties and in particular their squared Euclidean nature are investigated through their power indices and multidimensional scaling properties. In particular, formal and numerical studies demonstrate the existence of critical temperatures, where flow-based dissimilarities cease to be squared Euclidean. The clustering potential of medium range temperatures is emphasized.

Guillaume Guex, François Bavaud
Resampling Techniques in Cluster Analysis: Is Subsampling Better Than Bootstrapping?

In the case of two small toy data sets, we found out that subsampling has a much weaker behavior in the finding of the true number of clusters

K

than bootstrapping (Mucha and Bartel, Soft bootstrapping in cluster analysis and its comparison with other resampling methods. In: M. Spiliopoulou, L. Schmidt-Thieme, R. Janning (eds.) Data analysis, machine learning and knowledge discovery. Springer, Cham, 2014). In contradiction, Möller and Dörte (Intell Data Anal 10:139–162, 2006) pointed out that “subsampling

clearly outperformed the bootstrapping technique in the detection of correct clustering consensus results.” Obviously, there is a need for further investigations. Therefore here we compare these two resampling techniques based on real and artificial data sets by means of different indices: ARI or Jaccard. We consider hierarchical cluster analysis methods because they find all partitions into

K

 = 2, 3, 

clusters in one run only, and, moreover, these results are (usually) unique (Spaeth, Cluster analysis algorithms for data reduction and classification of objects. Ellis Horwood, Chichester, 1982). The methods are tested on two synthetic data sets and two real data sets. Obviously, bootstrapping is better than subsampling in finding the true number of clusters.

Hans-Joachim Mucha, Hans-Georg Bartel
On-Line Clustering of Functional Boxplots for Monitoring Multiple Streaming Time Series

In this paper we introduce a micro-clustering strategy for functional boxplots. The aim is to summarize a set of streaming time series split in non-overlapping windows. It is a two-step strategy which performs at first, an on-line summarization by means of functional data structures, named Functional Boxplot micro-clusters; then, it reveals the final summarization by processing, off-line, the functional data structures. Our main contribute consists in providing a new definition of micro-cluster based on Functional Boxplots and in defining a proximity measure which allows to compare and update them. This allows to get a finer graphical summarization of the streaming time series by five functional basic statistics of data. The obtained synthesis will be able to keep track of the dynamic evolution of the multiple streams.

Elvira Romano, Antonio Balzanella
Smooth Tests of Fit for Gaussian Mixtures

Model based clustering and classification are often based on a finite mixture distribution. The most popular choice for the mixture component distribution is the Gaussian distribution (Fraley and Raftery, J Stat Softw 18(6):1–13, 2007). Many tests, for example those based on goodness of fit measures, focus on detecting the order of the mixture. However what is often neglected are diagnostic tests to confirm the distributional assumptions. This may lead to the cluster analysis having invalid conclusions.

Smooth tests (Rayner et al., Smooth tests of goodness of fit: using R, 2nd edn. Wiley, Singapore, 2009) can be used to test the distributional assumptions against the so-called general smooth alternatives in the sense of Neyman (Skandinavisk Aktuarietidskr 20:150–99, 1937). To test for a mixture distribution we present smooth tests that have the additional advantage that they permit the testing of sub-hypotheses using components. These test statistics are asymptotically chi-squared distributed. Results of the simulation study show that bootstrapping needs to be applied for small to medium sample sizes to maintain the P(type I error) at the nominal level and that the proposed tests have high power against various alternatives. Lastly the tests are illustrated on a data set on the average amount of precipitation in inches for each of 70 United States and Puerto Rico cities (Mcneil, Interactive data analysis. Wiley, New York, 1977).

Thomas Suesse, John Rayner, Olivier Thas

Machine Learning and Knowledge Discovery

Frontmatter
P2P RVM for Distributed Classification

In recent years there is an increasing interest for analytical methods that learn patterns over large-scale data distributed over Peer-to-Peer (P2P) networks and support applications. Mining patterns in such distributed and dynamic environment is a challenging task, because centralization of data is not feasible. In this paper, we have proposed a distributed classification technique based on relevance vector machines (RVM) and local model exchange among neighboring peers in a P2P network. In such networks, the evaluation criteria for an efficient distributed classification algorithm is based on the size of resulting local models (communication efficiency) and their prediction accuracy. RVM utilizes dramatically fewer kernel functions than a state-of-the-art “support vector machine” (SVM), while demonstrating comparable generalization performance. This makes RVM a suitable choice to learn compact and accurate local models at each peer in a P2P network. Our model propagation approach, exchange resulting models with peers in a local neighborhood to produce more accurate network wide global model, while keeping the communication cost low throughout the network. Through extensive experimental evaluations, we demonstrate that by using more relevant and compact models, our approach outperforms the baseline model propagation approaches in terms of accuracy and communication cost.

Muhammad Umer Khan, Alexandros Nanopoulos, Lars Schmidt-Thieme
Selecting a Multi-Label Classification Method for an Interactive System

Interactive classification-based systems engage users to coach learning algorithms to take into account their own individual preferences. However most of the recent interactive systems limit the users to a single-label classification, which may be not expressive enough in some organization tasks such as film classification, where a multi-label scheme is required. The objective of this paper is to compare the behaviors of 12 multi-label classification methods in an interactive framework where “good” predictions must be produced in a very short time from a very small set of multi-label training examples. Experimentations highlight important performance differences for four complementary evaluation measures (Log-Loss, Ranking-Loss, Learning and Prediction Times). The best results are obtained for Multi-label

k

Nearest Neighbors (ML-

k

NN), ensemble of classifier chains (ECC), and ensemble of binary relevance (EBR).

Noureddine-Yassine NAIR-BENREKIA, Pascale Kuntz, Frank Meyer
Visual Analysis of Topics in Twitter Based on Co-evolution of Terms

The analysis of Twitter short messages has become a key issue for companies seeking to understand consumer behaviour and expectations. However, automatic algorithms for topic tracking often extract general tendencies at a high granularity level and do not provide added value to experts who are looking for more subtle information. In this paper, we focus on the visualization of the co-evolution of terms in tweets in order to facilitate the analysis of the evolution of topics by a decision-maker. We take advantage of the perceptual quality of heatmaps to display our 3D data (term × time × score) in a 2D space. Furthermore, by computing an appropriate order to display the main terms on the heatmap, our methodology ensures an intuitive visualization of their co-evolution. An experiment was conducted on real-life datasets in collaboration with an expert in customer relationship management working at the French energy company EDF. The first results show three different kinds of co-evolution of terms: bursty features, reoccurring terms and long periods of activity.

Lambert Pépin, Julien Blanchard, Fabrice Guillet, Pascale Kuntz, Philippe Suignard
Incremental Weighted Naive Bays Classifiers for Data Stream

A naive Bayes classifier is a simple probabilistic classifier based on applying Bayes’ theorem with naive independence assumption. The explanatory variables (

X

i

) are assumed to be independent from the target variable (

Y

). Despite this strong assumption this classifier has proved to be very effective on many real applications and is often used on data stream for supervised classification. The naive Bayes classifier simply relies on the estimation of the univariate conditional probabilities

P

(

X

i

 | 

C

). This estimation can be provided on a data stream using a “supervised quantiles summary.” The literature shows that the naive Bayes classifier can be improved (1) using a variable selection method (2) weighting the explanatory variables. Most of these methods are related to batch (off-line) learning and need to store all the data in memory and/or require reading more than once each example. Therefore they cannot be used on data stream. This paper presents a new method based on a graphical model which computes the weights on the input variables using a stochastic estimation. The method is incremental and produces a Weighted Naive Bayes Classifier for data stream. This method will be compared to classical naive Bayes classifier on the Large Scale Learning challenge datasets.

Christophe Salperwyck, Vincent Lemaire, Carine Hue
SVM Ensembles Are Better When Different Kernel Types Are Combined

Support vector machines (SVM) are strong classifiers, but large datasets might lead to prohibitively long computation times and high memory requirements. SVM ensembles, where each single SVM sees only a fraction of the data, can be an approach to overcome this barrier. In continuation of related work in this field we construct SVM ensembles with Bagging and Boosting. As a new idea we analyze SVM ensembles with different kernel types (linear, polynomial, RBF) involved inside the ensemble. The goal is to train

one

strong SVM ensemble classifier for large datasets with less time and memory requirements than a single SVM on all data. From our experiments we find evidence for the following facts: Combining different kernel types can lead to an ensemble classifier stronger than each individual SVM on all training data and stronger than ensembles from a single kernel type alone. Boosting is only productive if we make each single SVM sufficiently weak, otherwise we observe overfitting. Even for very small training sample sizes—and thus greatly reduced time and memory requirements—the ensemble approach often delivers accuracies similar or close to a single SVM trained on all data.

Jörg Stork, Ricardo Ramos, Patrick Koch, Wolfgang Konen

Data Analysis in Marketing

Frontmatter
Ratings-/Rankings-Based Versus Choice-Based Conjoint Analysis for Predicting Choices

Nowadays, for market simulation in consumer markets with multi-attributed products, choice-based conjoint analysis (CBC) is most popular. The popularity stems—on one side—from the possibility to use online-panels for affordable data collection and—on the other side—from the possibility to estimate part worths at the respondent level using only few observations. However, a still open question is, whether this money- and time-saving approach provides the same or even better results than ratings-/rankings-based alternatives. An experiment with 787 students from Poland and Germany is used to answer this question: Cola preferences are measured using CBC as well as ratings-/rankings-based alternatives. The results are compared using the Multitrait-Multimethod Matrix for the estimated part worths and first choice hit rates for holdout choice sets. The experiment shows a superiority of CBC, but also important differences between Polish and German cola consumers that outweigh methodological differences.

Daniel Baier, Marcin Pełka, Aneta Rybicka, Stefanie Schreiber
A Statistical Software Package for Image Data Analysis in Marketing

The strongly growing number of available images reveals a great opportunity for a new age in the field of statistical analysis. Today, several thousand digital images are taken and published every day but not used for marketing purposes. Common statistical tools like SPSS, SAS, R, MATLAB, or RapidMiner still provide none or insufficient image processing packages. In this paper we introduce IMADAC, a statistical software in expansion of Naundorf et al. (Computer science reports. Institute of Computer Science, Brandenburg University of Technology, Cottbus, 2012) and Zellhöfer et al. (Proceedings of the 2nd ACM international conference on multimedia retrieval, ICMR ’12, pp. 59–60, 2012). IMADAC, designed for experts as well as users without image processing background, combines statistical analysis on both, common statistical data (e.g., age or gender) and image processing methods. This paper demonstrates the usage of low level image features for statistical purposes (e.g., clustering or multi-dimensional scaling). To improve marketing analysis results, we further show how to combine image features with other statistical data and how it can be done in a graphical user interface (GUI).

Thomas Böttcher, Daniel Baier, Robert Naundorf
The Bass Model as Integrative Diffusion Model: A Comparison of Parameter Influences

New technologies are permanently developed and introduced into markets. Although their adoption process is extremely volatile and varies from case to case, it is of extreme interest to companies to somehow plan and especially to estimate the development. For these estimations so-called diffusion models are utilized. A well-known and often used one is the Bass model, which incorporates different parameters and their specific influences. Our paper analyzes what kind of parameters (e.g., coefficient of innovation, underlying distribution) have what kind of influence (e.g., number of adoptions, standard deviation from adoption time) on the diffusion estimations. For the analysis the market of electric vehicles with its politically motivated objectives and current sales quantities serves as an application example. For the analysis itself, a factorial design with synthetically generated and disturbed data is applied.

Michael Brusch, Sebastian Fischer, Stephan Szuppa
Preference Measurement in Complex Product Development: A Comparison of Two-Staged SEM Approaches

Since many years, preference measurement has been used to understand the importance that customers ascribe to alternative possible product attribute-levels. Available for this purpose are, e.g., compositional approaches based on the self-explicated-model (SEM) as well as decompositional ones based on conjoint analysis (CA). Typically, in SEM approaches, customers evaluate the importance of product attributes one by one whereas in decompositional approaches, they evaluate possible alternative products (attribute-level combinations) followed by a derivation of the importances. The SEM approaches seem to be superior when products are complex and the number of attributes is high. However, there are still improvement possibilities. In this paper two innovative two-staged SEM approaches are proposed and tested. The complex products under study are small remotely piloted aircraft systems (small RPAS) for German search and rescue (SAR) forces.

Jörgen Eimecke, Daniel Baier
Combination of Distances and Image Features for Clustering Image Data Bases

Daily, millions of pictures are released online but it is hard to analyze them automatically for marketing purposes. This paper tries to show how methods from the content-based image retrieval could be used to classify image data and make them usable for marketing applications. There are a number of different image features which can be extracted from the images to calculate dissimilarities between them afterwards with different kinds of distance measures (Manjunath et al. 2001). We focus especially on mass-transportation-problems, like the Earth Mover’s Distance (EMD) (Rubner et al., Int J Comput Vis 40(2):99–121, 2000), because they fit the human perception on dissimilarities. Furthermore there are already some studies that show that they are robust to disturbances like changes in resolution, contrast, or noise (Frost and Baier, Algorithms from and for nature and life. Studies in classification, data analysis, and knowledge organization, vol 45. Springer, Heidelberg, 2013). We compare some approximations of the EMD (e.g., Pele and Werman 2009) with an approximation algorithm developed by ourselves. The aim is to find a combination of features and distances which allows to cluster large image data bases in a way that fits the human perception.

Sarah Frost, Daniel Baier
A Game Theoretic Product Design Approach Considering Stochastic Partworth Functions

Developing new products is a necessary but costly and risky adventure. Therefore, the customers’ point of view and the prospective competitive environment have to be considered. Here, conjoint analysis has proven to be helpful since this preference modeling approach can be used to predict market shares [see, e.g., Baier and Gaul (J Econ 89(1–2):365–392, 1999), Baier and Gaul (Conjoint measurement: methods and applications. Springer, Berlin, pp. 47–66, 2007)]. When, additionally, competitive reactions must be considered, game theoretic approaches are a helpful extension [see, e.g., Choi and Desarbo (Market Lett 4(4):337–348, 1993), Steiner and Hruschka (OR Spektrum 22:71–95, 2000), Steiner (OR Spectrum 32:21–48, 2010)]. However, recently, new Bayesian procedures have been developed for conjoint analysis that allow to model customers’ partworth functions in a stochastic fashion. The idea is that customers have different preferences over time. In this paper we propose a new game theoretic approach that considers this new aspect. The new approach is applied to a (fictive) product design setting. A comparison to a traditional approach is presented.

Daniel Krausche, Daniel Baier
Key Success-Determinants of Crowdfunded Projects: An Exploratory Analysis

Crowdfunding, a process with which enterprises or individuals seek to secure project funding, has received much attention recently, not only from the media. The boon in visibility provided to crowdfunding by Internet platforms has made securing project funding, by soliciting pledges from potential donors, simpler than ever. A popular way of allocating funding, and thus bypassing traditional venture capital providers, is by setting a reserve pledge-sum. If this pledge-sum is achieved, the promised pledges are collected from the project supporters. Upon project completion, these pledgers receive a compensation, which is usually non-monetary and based on the magnitude of their contribution. Projects funded in this way include a wide topic variety, ranging from hardware manufacturing to fine arts and even disaster relief. This study investigates possible key success factors for attaining the reserve pledge-sum. To this end, data on 45,400 crowdfunding campaigns was collected and key success factors were analyzed using the results of a logistic-regression. The results indicate that communications and professionalism have a high impact on funding success, and that such communication measures as having a unique website set a minimum standard. Further conclusions allow practitioners to positively influence the campaign outcome and researchers to build upon the results of this study.

Thomas Müllerleile, Dieter William Joenssen
Preferences Interdependence Among Family Members: Case III/APIM Approach

The purpose of the paper is to identify the preference structures in the framework of actor-partner interdependence (APIM) model based on paired-comparison or ranking data (Thurstone Case III/V model). The households preferences of the income allocation between consumption, savings and investments are considered. Then, the preference structures among the families are identified on the basis of Thurstonian Case III preference model. Latent preferences are used to modeling the actor-partner interdependencies between the household members.

Adam Sagan

Data Analysis in Biostatistics and Bioinformatics

Frontmatter
Evaluation of Cell Line Suitability for Disease Specific Perturbation Experiments

Cell lines are widely used in translational biomedical research to study the genetic basis of diseases. A major approach for experimental disease modeling are genetic perturbation experiments that aim to trigger selected cellular disease states. In this type of experiments it is crucial to ensure that the targeted disease-related genes and pathways are intact in the used cell line. In this work we are developing a framework which integrates genetic sequence information and disease-specific network analysis for evaluating disease-specific cell line suitability.

Maria Biryukov, Paul Antony, Abhimanyu Krishna, Patrick May, Christophe Trefois
Effect of Hundreds Sequenced Genomes on the Classification of Human Papillomaviruses

The classification of the hundreds of papillomaviruses (PVs) still constitutes a major issue in virology, disease diagnosis, and therapy. Since 2003, PVs are classified within three levels of hierarchical clusters according to their similarity and their position in the phylogenetic tree, using the DNA sequence of the L1 gene. With the increased number of sequenced genomes, the boundaries of the different clusters within the different levels might overlap and the topology of the associated tree could change, thus avoiding a unique and coherent classification. Here, we studied the classification of 560 currently available human PVs (HPV) with respect to the criteria established 10 years ago as well as novel ones. The results highlight that current taxonomic identification does fit with the monophyletic criteria for the L1 gene, but the sequence similarity criteria violates the established boundaries to classify PVs. Finally, we argue that the substitution of L1 gene similarity by the whole genome similarity would allow to have less overlap between the different clusters and provide a better classification.

Bruno Daigle, Vladimir Makarenkov, Abdoulaye Baniré Diallo
Donor Limited Hot Deck Imputation: A Constrained Optimization Problem

Hot deck methods impute missing data by matching records that are complete to those that are missing values. Observations absent within the recipient are then replaced by replicating the values from the matched donor. Some hot deck procedures constrain the frequency with which any donor may be matched to increase the precision of post-imputation parameter-estimates. This constraint, called a donor limit, also mitigates risks of exclusively using one donor for all imputations or using one donor with an extreme value or values “too often.” Despite these desirable properties, imputation results of a donor limited hot deck are dependent on the recipients’ order of imputation, an undesirable property. For nearest neighbor type hot deck procedures, the implementation of a constraint on donor usage causes the stepwise matching between each recipient and its closest donor to no longer minimize the sum of all donor–recipient distances. Thus, imputation results may further be improved by procedures that minimize the total donor–recipient distance-sum. The discrete optimization problem is formulated and a simulation detailing possible improvements when solving this integer program is presented.

Dieter William Joenssen
Ensembles of Representative Prototype Sets for Classification and Data Set Analysis

The drawback of many state-of-the-art classifiers is that their models are not easily interpretable. We recently introduced Representative Prototype Sets (RPS), which are simple base classifiers that allow for a systematic description of data sets by exhaustive enumeration of all possible classifiers.

The major focus of the previous work was on a descriptive characterization of low-cardinality data sets. In the context of prediction, a lack of accuracy of the simple RPS model can be compensated by accumulating the decisions of several classifiers. Here, we now investigate ensembles of RPS base classifiers in a predictive setting on data sets of high dimensionality and low cardinality. The performance of several selection and fusion strategies is evaluated. We visualize the decisions of the ensembles in an exemplary scenario and illustrate links between visual data set inspection and prediction.

Christoph Müssel, Ludwig Lausser, Hans A. Kestler
Event Prediction in Pharyngeal High-Resolution Manometry

A prolonged phase of increased pressure in the upper esophageal sphincter (UES) after swallowing might result in globus sensation. Therefore, it is important to evaluate

restitution times

of the UES in order to distinguish physiologic from impaired swallow associated activities. Estimating the event

$$t^{\star }$$

where the UES has returned to its resting pressure after swallowing can be accomplished by predicting if swallowing activities are present or not. While the problem, whether a certain swallow is pathologic or not, is approached in Mielens (J Speech Lang Hear Res 55:892–902, 2012), the analysis conducted in this paper advances the understanding of normal pharyngoesophageal activities.

From the machine learning perspective, the problem is treated as binary sequence labeling, aiming to find a sample

$$t^{\star }$$

within the sequence obeying a certain characteristic: We strive for a best approximation of label transition which can be understood as a dissection of the sequence into individual parts. Whereas common models for sequence labeling are based on graphical models (Nguyen and Guo, Proceedings of the 24th International Conference on Machine Learning. ACM, New York, pp. 681–688, 2007), we approach the problem using a logistic regression as classifier, integrate sequential features by means of FFT-coefficients and a Laplacian regularizer in order to encourage a smooth classification due to the monotonicity of target labels.

Nicolas Schilling, Andre Busche, Simone Miller, Michael Jungheim, Martin Ptok, Lars Schmidt-Thieme
Edge Selection in a Noisy Graph by Concept Analysis: Application to a Genomic Network

MicroRNAs (miRNAs) are small RNA molecules that bind messenger RNAs (mRNAs) to silence their expression. Understanding this regulation mechanism requires the study of the miRNA/mRNA interaction network. State-of-the-art methods for predicting interactions lead to a high level of false positive: the interaction score distribution may be roughly described as a mixture of two overlapping Gaussian laws that need to be discriminated with a threshold. In order to further improve the discrimination between true and false interactions, we present a method that considers the structure of the underlying graph. We assume that the graph is formed on a relatively simple structure of formal concepts (associated with regulation modules in the regulation mechanism). Specifically, the formal context topology of true edges is assumed to be less complex than in the case of a noisy graph including spurious interactions or missing interactions. Our approach consists thus in selecting edges below an edge score threshold and applying a repair process on the graph, adding or deleting edges to decrease the global concept complexity. To validate our hypothesis and method, we have extracted parameters from a real biological miRNA/mRNA network and used them to build random networks with fixed concept topology and true/false interaction ratio. Each repaired network can be evaluated with a score balancing the number of edge changes and the conceptual adequacy in the spirit of the minimum description length principle.

Valentin Wucher, Denis Tagu, Jacques Nicolas

Data Analysis in Education and Psychology

Frontmatter
Linear Modelling of Differences in Teacher Judgment Formation of School Tracking Recommendations

The present paper investigates the application of two regression-based approaches, individual multiple regression and hierarchical linear modelling, in modelling differences in judgment formation of primary school teachers’ secondary school track recommendations. Both approaches share the same theoretical framework of judgment formation as a weighted linear information integration, but differ in their capacity to take differences in judgment formation into account. First, both approaches were applied to empirical data on teachers’ track recommendations and led to deviating conclusions on differences in judgment formation. To investigate which approach results in more reliable representation of actual differences in judgment formation, both approaches were compared based on simulated data and hierarchical linear modelling performed slightly more accurate than individual regression. Thus, hierarchical linear modelling might be considered the preferable modelling approach in research on judgments on school tracking recommendations.

Thomas Hörstermann, Sabine Krolak-Schwerdt
Psychometric Challenges in Modeling Scientific Problem-Solving Competency: An Item Response Theory Approach

The ability to solve complex problems is one of the key competencies in science. In previous research, modeling scientific problem solving has mainly focused on the dimensionality of the construct, but rarely addressed psychometric test characteristics such as local item dependencies which could occur, especially in computer-based assessments. The present study consequently aims to model scientific problem solving by taking into account four components of the construct and dependencies among items within these components. Based on a data set of 1,487 German high-school students of different grade levels, who worked on computer-based assessments of problem solving, local item dependencies were quantified by using testlet models and

Q

3

statistics. The results revealed that a model differentiating testlets of cognitive processes and virtual systems fitted the data best and remained invariant across grades.

Ronny Scherer
The Luxembourg Teacher Databank 1845–1939. Academic Research into the Social History of the Luxembourg Primary School Teaching Staff

From 1845 to 1939 the pedagogical journal

Der Luxemburger Schulbote

published a comprehensive annual directory of the primary school teaching staff of the Grand Duchy. On the basis of this directory, we have established a databank encompassing 75,000 entries relating to a total of approx. 4,700 primary school teachers, both male and female, who taught in the Grand Duchy during this period. With the assistance of IBM SPSS Statistics, we have been able to process the data and compile a collective biography or prosopography that provides a profound insight into the development of an occupational group over a period of nearly 100 years at a local, regional and national level. This paper presents an analysis of initial research findings relating to the number of teaching staff, length of service and the level of qualification and mobility among teaching staff for the first half of this period from 1845 to 1895.

Peter Voss, Etienne Le Bihan

Data Analysis in Musicology

Frontmatter
Correspondence Analysis, Cross-Autocorrelation and Clustering in Polyphonic Music

This paper proposes to represent symbolic polyphonic musical data as contingency tables based upon the duration of each pitch for each time interval. Exploratory data analytic methods involve weighted multidimensional scaling, correspondence analysis, hierarchical clustering, and general autocorrelation indices constructed from temporal neighborhoods. Beyond the analysis of single polyphonic musical scores, the methods sustain inter-voices as well as inter-scores comparisons, through the introduction of ad hoc measures of configuration similarity and cross-autocorrelation. Rich musical patterns emerge in the related applications, and preliminary results are encouraging for clustering tasks.

Christelle Cocco, François Bavaud
Impact of Frame Size and Instrumentation on Chroma-Based Automatic Chord Recognition

This paper presents a comparative study of classification performance in automatic audio chord recognition based on three chroma feature implementations, with the aim of distinguishing effects of frame size, instrumentation, and choice of chroma feature. Until recently, research in automatic chord recognition has focused on the development of complete systems. While results have remarkably improved, the understanding of the error sources remains lacking. In order to isolate sources of chord recognition error, we create a corpus of artificial instrument mixtures and investigate (a) the influence of different chroma frame sizes and (b) the impact of instrumentation and pitch height. We show that recognition performance is significantly affected not only by the method used, but also by the nature of the audio input. We compare these results to those obtained from a corpus of more than 200 real-world pop songs from The Beatles and other artists for the case in which chord boundaries are known in advance.

Daniel Stoller, Matthias Mauch, Igor Vatolkin, Claus Weihs
Interpretable Music Categorisation Based on Fuzzy Rules and High-Level Audio Features

Music classification helps to manage song collections, recommend new music, or understand properties of genres and substyles. Until now, the corresponding approaches are mostly based on less interpretable low-level characteristics of the audio signal, or on metadata, which are not always available and require high efforts for filtering the relevant information. A listener-friendly approach may rather benefit from high-level and meaningful characteristics. Therefore, we have designed a set of high-level audio features, which is capable to replace the baseline low-level feature set without a significant decrease of classification performance. However, many common classification methods change the original feature dimensions or create complex models with lower interpretability. The advantage of the fuzzy classification is that it describes the properties of music categories in an intuitive, natural way. In this work, we explore the ability of a simple fuzzy classifier based on high-level features to predict six music genres and eight styles from our previous studies.

Igor Vatolkin, Günter Rudolph

Data Analysis in Communication and Technology

Frontmatter
What Is in a Like? Preference Aggregation on the Social Web

The Social Web is dominated by rating systems such as the ones of Facebook (only

“Like”

), YouTube (both

“Like”

and

“Dislike”

), or the Amazon product review

5-star rating

. All these systems try to answer on

How should a social application pool the preferences of different agents so as to best reflect the wishes of the population as a whole?

The main framework is the theory of

social choice

(Arrow, Social choice and individual values, Wiley, New York, 1963; Fishburn, The theory of social choice, Princeton University Press, Princeton, 1973) i.e., agents have preferences, and do not try to camouflage them in order to manipulate the outcome to their personal advantage (moreover, manipulation is quite difficult when interactions take place at the Web scale). Our approach uses a combination between the

Like/Dislike

system and a

5-star satisfaction

system to achieve local preference ranks and a global partial ranking on the outcomes set. Moreover, the actual data collection can support other preference learning techniques such as the ones introduced by Baier and Gaul (J. Econ. 89:365–392, 1999), Cohen et al. (J. Artif. Intel. Res. 10:213–270, 1999), Fürnkranz and Hüllermeier (Künstliche Intelligenz 19(1):60–61, 2005), and Hüllermeier et al. (Artif. Intel. 172(16–17):1897–1916, 2008).

Adrian Giurca, Daniel Baier, Ingo Schmitt
Predicting Micro-Level Behavior in Online Communities for Risk Management

Online communities amass vast quantities of valuable knowledge and thus generate major value to their owners. Where these communities are incorporated in a business as the main means of sharing ideas and issues regarding products produced by the business, it is important that the value of this knowledge endures and is easily recognized. For good management of such a business, risk analysis of the integrated online community is required. We choose to focus on the process of knowledge creation rather than the knowledge gained from individual messages isolated from context. Consequently, we model collections of messages, linked via tree-like structures; these message collections we call threads. Here we suggest a risk framework aimed at managing micro-level thread related risks. Specifically, we target the risk that there is no satisfactory response to the original message after a period of time. Risks are considered as binary events; the event can therefore be flagged when it is predicted to occur for the attention of the community manager. To predict such a binary response, we use several methods, including a Bayesian probit regression estimated via Gibbs sampling; results indicate this model to be suitable for classification tasks such as those considered.

Philippa A. Hiscock, Athanassios N. Avramidis, Jörg Fliege
Human Performance Profiling While Driving a Sidestick-Controlled Car

We have established a metric for measuring human performance while operating a sidestick-controlled car and have used it in conjunction with a known environment type to identify unusual steering trends. We focused on the analysis of the vehicle’s offset from the lane center in the time domain and identified a set of this signal’s features shared by all test drivers. The distribution of these features identifies a specific driving environment type and represents the essence of the proposed metric. We assumed that the driver performance, while operating a sidestick-controlled car, is determined by the environment type on one side and the driver’s own mental state on the other. The goal is to detect the mismatch of the assumed driving environment, gained from the introduced metric, and a ground truth about the actual environmental type, which can be obtained through map and GPS data, in order to identify unusual steering trend possibly caused by a change in driver fitness.

Ljubo Mercep, Gernot Spiegelberg, Alois Knoll
Multivariate Landing Page Optimization Using Hierarchical Bayes Choice-Based Conjoint

Landing pages are defined to be the home page of a website (e.g., an online shop) or a specific webpage that appears in response to an ad. Their design plays an important role in decreasing the number of visitors leaving the website without any activity (e.g., clicking a banner, purchasing a product). For improving landing pages, the traditional A/B testing approach offers a simple but limited solution to evaluate two different variants. However, recently, new approaches have been introduced. Webpages with multiple variations of website elements (e.g., navigation menu, advertising banners) generated through experimental designs are rated by customers (Gofman et al., J. Consum. Mark. 26(4):286–298, 2009).The paper explores a new approach for multivariate landing page optimization using hierarchical Bayes choice-based conjoint analysis (CBC/HB) that combines the potential to test a large number of variants with a short survey. The new approach is discussed and applied to improve the online shop of a popular German Internet pharmacy. Choice data are collected from a large sample of customers. From the results an optimal landing page is derived and implemented.

Stefanie Schreiber, Daniel Baier
Distance Based Feature Construction in a Setting of Astronomy

The MAGIC and FACT telescopes on the Canary Island of La Palma are both imaging Cherenkov telescopes. Their purpose is to detect highly energetic gamma particles sent out by various astrophysical sources. Due to characteristics of the detection process not only gamma particles are recorded, but also other particles summarized as hadrons. For further analysis the gamma ray signal has to be separated from the hadronic background. So far, so-called Hillas parameters (Hillas, Proceedings of the 19th International Cosmic Ray Conference ICRC, San Diego, 1985) are used as features in a classification algorithm for the separation. These parameters are only a first heuristic approach to describe signal events, so that it is desirable to find better features for the classification. We construct new features by using distance measures between the observed Cherenkov light distribution in the telescope camera and an idealized model distribution for the signal events, which we deduce from simulations and which takes, for example, the alignment and shape of an event into account. The new features added to the Hillas parameters lead to substantial gains in terms of classification.

Tobias Voigt, Roland Fried

Data Analysis in Administration and Spatial Planning

Frontmatter
Hough Transform and Kirchhoff Migration for Supervised GPR Data Analysis

Ground penetrating radar (GPR) is a widely used technology for detecting buried objects in the subsoil. Radar measurements are usually depicted as radargram images, including distorted hyperbola-like shapes representing pipes running non-parallel to the measurement trace. Also because of the heterogeneity of subsoil, human experts are usually analysing radargrams only in a semi-automatic way by adjusting parameters of the detection models (exposed by the software used) to get best detection results. To gain a set of approximate hyperbola apex positions, unsupervised methods such as the Hough transform (HT) or Kirchhoff Migration are often used. By having high-quality, large-scale real-world measurement data collected on a specialized test site at hand, we both (a) analyse differences and similarities of the HT and Kirchhoff Migration quantitatively and analytically with respect to different preprocessing techniques, and (b) embed results from either technique into a supervised framework. The primary contribution of this paper is the conduction of an exhaustive experiment, not only showing their equivalence, but also showing that their application for the automated analysis of GPR data, unlike it is currently assumes, does not improve the detection performance significantly.

Andre Busche, Daniel Seyfried, Lars Schmidt-Thieme
Application of Hedonic Methods in Modelling Real Estate Prices in Poland

This paper concentrates on empirical and methodological issues of the application of econometric methods to modelling real estate market. The presented hedonic analysis of apartments’ prices in Wrocław is based on the dataset consisting of over ten thousands offers from the secondary real estate market. The models estimated as the result of the research allow for pricing the apartments, as well as its characteristics.The foundations of hedonic methods are formed by the so-called hedonic hypothesis which states that heterogeneous commodities are characterized by a set of attributes relevant both from the point of view of the customer and the producer. As a consequence, the price of a commodity is determined as an aggregate of values estimated for each significant characteristic of this commodity. The hedonic model allows to price the commodity as well as to identify and estimate the prices of respective attributes, including the prices which are not directly observable on the market. The latter is particularly useful for the real estate market as it enables pricing location-related, neighbourhood-related and structure-related characteristics of housing whose values cannot be obtained otherwise.

Anna Król
Smart Growth Path as the Basis for the European Union Countries Typology

The concept of smart growth integrates activities in the area of smart specialization, creativity and innovation influencing development opportunities of particular European countries. The objective of the paper is to classify the EU countries with regard to smart growth paths by means of multivariate statistical analysis methods. The concept of smart growth path was defined considering the direction and intensity of changes occurring in the area of smart specialization, creativity and innovation. These paths became the basis for the European Union member states classification carried out using cluster analysis methods. The presented analysis is of dynamic nature and allows for the smart growth patterns typology.

Elżbieta Sobczak, Beata Bal-Domańska
The Influence of Upper Level NUTS on Lower Level Classification of EU Regions

The Nomenclature of Territorial Units for Statistics or Nomenclature of Units for Territorial Statistics (NUTS) is a geocode standard for referencing the subdivision of countries for statistical purposes. It covers the member states of the European Union. For each EU member country, a hierarchy of three levels is established by Eurostat. In 27 EU countries we have 97 regions at NUTS1, 271 regions at NUTS2 and 1,303 regions at NUTS3. They are subject of many statistical analysis involving clustering methods. Having a partition of units on a given level, we can ask the question, whether this partition has been influenced by the upper level division of Europe. For example, after finding groups of homogeneous levels of NUTS 2 regions we would like to know if the partition has been influenced by differences between countries. In the paper we propose a procedure for testing the statistical significance of influence of upper level units on a given partition. If there is no such influence, we can expect that the number of between-groups borders which are also country borders should have a proper probability distribution. A simulation procedure for finding this distribution and its critical values for testing significance is proposed in this paper. The real data analysis shown as an example deals with the innovativeness of German districts and the influence of government regions on innovation processes.

Andrzej Sokołowski, Małgorzata Markowska, Danuta Strahl, Marek Sobolewski

Data Analysis in Library Science

Frontmatter
Multilingual Subject Retrieval: Bibliotheca Alexandrina’s Subject Authority File and Linked Subject Data

The Bibliotheca Alexandrina (BA) has been developing its own authority file since September 2006. The file includes subject headings, personal names, corporate bodies, series, and uniform titles. The BA authority file is unique in that it is constructed on the actual collection and it establishes subject headings in the language of the item described. The focus of this research is on the subject component of the BA authority file that includes subject headings in Latin and Arabic scripts. This study will describe the workflow involved in creating the subject heading authority file and the linking data process involved in linking subject terms in Arabic, English, and French.

Magda El-Sherbini
The VuFind Based “MT-Katalog”: A Customized Music Library Service at the University of Music and Drama Leipzig

For some time, large academic libraries have been offering discovery systems in order to allow access not only to their library holdings but also to their licensed electronic materials. Most of these libraries integrate huge commercial indices into their discovery systems. But it is only now that special libraries are starting to discuss whether those indices meet their demands, too.

As a part of a cooperative project, the library of the University of Music and Drama in Leipzig, Germany, installed the open source system

VuFind

(see

http://www.vufind.org

). It was accompanied by discussions about how to develop a discovery system that is transparent and suitable to the users’ needs.

The following paper will show the reflections on that matter that finally led to the new

MT-Katalog

, which offers higher comfort of use and broader scope of search than our previous catalogue. We will take into account which additional musical and music-related e-resources can be found, selected and integrated. The paper will also provide ideas for improved use and enhancement of metadata.

Anke Hofmann, Barbara Wiermann
Backmatter
Metadaten
Titel
Data Science, Learning by Latent Structures, and Knowledge Discovery
herausgegeben von
Berthold Lausen
Sabine Krolak-Schwerdt
Matthias Böhmer
Copyright-Jahr
2015
Verlag
Springer Berlin Heidelberg
Electronic ISBN
978-3-662-44983-7
Print ISBN
978-3-662-44982-0
DOI
https://doi.org/10.1007/978-3-662-44983-7