Skip to main content

2014 | Buch

German-Japanese Interchange of Data Analysis Results

insite
SUCHEN

Über dieses Buch

This volume focuses on innovative approaches and recent developments in clustering, analysis of data and models, and applications: The first part of the book covers a broad range of innovations in the area of clustering, from algorithmic innovations for graph clustering to new visualization and evaluation techniques. The second part addresses new developments in data and decision analysis (conjoint analysis, non-additive utility functions, analysis of asymmetric relationships, and regularization techniques). The third part is devoted to the application of innovative data analysis methods in the life-sciences, the social sciences and in engineering. All contributions in this volume are revised and extended versions of selected papers presented in the German/Japanese Workshops at Karlsruhe (2010) and Kyoto (2012).

Inhaltsverzeichnis

Frontmatter

Clustering

Frontmatter
Model-Based Clustering Methods for Time Series
Abstract
This paper considers the problem of clustering n observed time series \(\mathbf{x}_{k} =\{\ x_{k}(t)\ \vert \ t \in \mathcal{T}\}\), k = 1, , n, with time points t in a suitable time range \(\mathcal{T}\), into a suitable number m of clusters \(C_{1},\ldots,C_{m} \subset \{ 1,\ldots,n\}\) each one comprising time series with a ‘similar’ structure. Classical approaches might typically proceed by first computing a dissimilarity matrix and then applying a traditional, possibly hierarchical clustering method. In contrast, here we will present a brief survey about various approaches that start by defining probabilistic clustering models for the time series, i.e., with class-specific distribution models, and then determine a suitable (hopefully optimum) clustering by statistical tools like maximum likelihood and optimization algorithms. In particular, we will consider models with class-specific Gaussian processes and Markov chains.
Hans-Hermann Bock
The Randomized Greedy Modularity Clustering Algorithm and the Core Groups Graph Clustering Scheme
Abstract
The modularity measure of Newman and Girvan is a popular formal cluster criterium for graph clustering. Although the modularity maximization problem has been shown to be NP-hard, a large number of heuristic modularity maximization algorithms have been developed. In the 10th DIMACS Implementation Challenge of the Center for Discrete Mathematics & Theoretical Computer Science (DIMACS) for graph clustering our core groups graph clustering scheme combined with a randomized greedy modularity clustering algorithm won both modularity optimization challenges: the Modularity (highest modularity) and the Pareto Challenge (tradeoff between modularity and performance). The core groups graph clustering scheme is an ensemble learning clustering method which combines the local solutions of several base algorithms to form a good start solution for the final algorithm. The randomized greedy modularity algorithm is a non-deterministic agglomerative hierarchical clustering approach which finds locally optimal solutions. In this contribution we analyze the similarity of the randomized greedy modularity algorithm with incomplete solvers for the satisfiability problem and we establish an analogy between the cluster core group heuristic used in core groups graph clustering and a sampling of restart points on the Morse graph of a continuous optimization problem with the same local optima.
Andreas Geyer-Schulz, Michael Ovelgönne
Comparison of Two Distribution Valued Dissimilarities and Its Application for Symbolic Clustering
Abstract
There are increasing requirements for analysing very large and complex datasets derived from recent super-high cost performance computer devices and its application software. We need to aggregate and then analyze those datasets. Symbolic Data Analysis (SDA) was proposed by E. Diday in 1980s (Billard L, Diday E (2007) Symboic data analysis. Wiley, Chichester), mainly targeted for large scale complex datasets. There are many researches of SDA with interval-valued data and histogram-valued data. On the other hand, recently, distribution-valued data is becoming more important, (e.g. Diday E, Vrac M (2005) Mixture decomposition of distributions by copulas in the symbolic data analysis framework, vo 147. Elsevier Science Publishers B. V., Amsterdam, pp 27–41; Mizuta M, Minami H (2012) Analysis of distribution valued dissimilarity data. In: Gaul WA, Geyer-Schulz A, Schmidt-Thieme L, Kunze J (eds) Challenges at the interface of data analysis, computer science, and optimization. Studies in classification, data analysis, and knowledge organization. Springer, Berlin/Heidelberg, pp 23–28). In this paper, we focus on distribution-valued dissimilarity data and hierarchical cluster analysis. Cluster analysis plays a key role in data mining, knowledge discovery, and also in SDA. Conventional inputs of cluster analysis are real-valued data, but in some cases, e.g., in cases of data aggregation, the inputs may be stochastic over ranges, i.e., distribution-valued dissimilarities. For hierarchical cluster analysis, an order relation of dissimilarity is necessary, i.e., dissimilarities need to satisfy the properties of an ultrametric. However, distribution-valued dissimilarity does not have a natural order relation. Therefore we develop a method for investigating order relation of distribution-valued dissimilarity. We also apply the ordering relation to hierarchical symbolic clustering. Finally, we demonstrate the use of our order relation for finding a hierarchical cluster of Japanese Internet sites according to Internet traffic data.
Yusuke Matsui, Yuriko Komiya, Hiroyuki Minami, Masahiro Mizuta
Pairwise Data Clustering Accompanied by Validation and Visualisation
Abstract
Pairwise proximities are often the starting point for finding clusters by applying cluster analysis techniques. We refer to this approach as pairwise data clustering (Mucha HJ (2009) ClusCorr98 for Excel 2007: clustering, multivariate visualization, and validation. In: Mucha HJ, Ritter G (eds) Classification and clustering: models, software and applications. Report 26, WIAS, Berlin, pp 14–40). A well known example is Gaussian model-based cluster analysis of observations in its simplest settings: the sum of squares and logarithmic sum of squares method. These simple methods can become more general by weighting the observations. By doing so, for instance, clustering the rows and columns of a contingency table will be performed based on pairwise chi-square distances. Finding the appropriate number of clusters is the ultimate aim of the proposed built-in validation techniques. They verify the results of the two most important families of methods, hierarchical and partitional clustering. Pairwise clustering should be accompanied by multivariate graphics such as heatmaps and plot-dendrograms.
Hans-Joachim Mucha
Classification, Clustering, and Visualisation Based on Dual Scaling
Abstract
In practice, the statistician is often faced with data already available. In addition, there are often mixed data. The statistician must now try to gain optimal statistical conclusions with the most sophisticated methods. But, are the variables scaled optimally? And, what about missing data? Without loss of generality here we restrict to binary classification/clustering. A very simple but general approach is outlined that is applicable to such data for both classification and clustering, based on data preparation (i.e., a down-grading step such as binning for each quantitative variable) followed by dual scaling (the up-grading step: scoring). As a byproduct, the quantitative scores can be used for multivariate visualisation of both data and classes/clusters. For illustrative purposes, a real data application to optical character recognition (OCR) is considered throughout the paper. Moreover, the proposed approach will be compared with other multivariate methods such as the simple Bayesian classifier.
Hans-Joachim Mucha
Variable Selection in K-Means Clustering via Regularization
Abstract
In many cases, both essential and irrelevant variables to the cluster structure are included in the data set. The K-means algorithm (MacQueen JB (1967) Some methods for classification and analysis of multivariate observations. In: Proceedings of the 5th Berkeley symposium on mathematical statistics and probability, Berkeley, pp 281–297) which is one of the most popular clustering method can treat such a data set, but can not identify the essential variables for clustering. In supervised-learning methods such as regression analysis, variable selection is a major topic. However, variable selection in clustering currently is not an active area of research. In this study, a new method of K-means clustering is proposed to detect irrelevant variables to the cluster structure. The proposed method achieves the purpose of calculating variable weights using an entropy regularization method (Miyamoto S, Mukaidono M (1997) Fuzzy c-means as a regularization and maximum entropy approach. In: Proceedings of the 7th international Fuzzy Systems Association World Congress, Prague, vol 2, pp 86–92) which is developed to obtain fuzzy memberships in fuzzy clustering. This method allows us to identify the important variables for clustering.
Yutaka Nishida
Structural Representation of Categorical Data and Cluster Analysis Through Filters
Abstract
Representation of categorical data by nominal measurement leaves the entire information intact, which is not the case with widely used numerical or pseudo-numerical representation such as Likert-type scoring. This aspect is first explained, and then we turn our attention to the analysis of nominally represented data. For the analysis of a large number of variables, one typically resorts to dimension reduction, and its necessity is often greater with categorical data than with continuous data. In spite of this, Nishisato S, Clavel JG (Behaviormetrika 57:15–32, 2010) proposed an approach which is diametrically opposite to the dimension-reduction approach, for they advocate the use of doubled hyper-space to accommodate both row variables and column variables of two-way data in a common space. The rationale of doubled space can be used to vindicate the validity of the Carroll-Green-Schaffer scaling (Carroll JD, Green PE, Schaffer CM (1986) J Mark Res 23(3):271–280). The current paper will then introduce a simple procedure for the analysis of a hyper-dimensional configuration of data, called cluster analysis through filters. A numerical example will be presented to show a clear contrast between the dimension-reduction approach and the total information analysis by cluster analysis. There is no doubt that our approach is preferred to the dimension-reduction approach on two grounds: our results are a factual summary of a multidimensional data configuration, and our procedure is simple and practical.
Shizuhiko Nishisato
Three-Mode Hierarchical Subspace Clustering with Noise Variables and Occasions
Abstract
Three-mode data are observed in various domains such as panel research in psychology studies. To conceive clustering structures from three-mode data as an initial analysis, a clustering algorithm is applied to the data. However, traditional clustering algorithms cannot factor in the effects of occasions. In addition, it is difficult to understand these typically high-dimensional data. Although Vichi et al. (J Classif 24(1):71–98, 2007) proposed three-way clustering, their algorithms are based on complicated assumptions.We propose three-mode subspace clustering based on entropy weights. The proposed algorithm excludes complicated assumptions and provides results that can be easily interpreted.
Kensuke Tanioka, Hiroshi Yadohisa

Analysis of Data and Models

Frontmatter
Bayesian Methods for Conjoint Analysis-Based Predictions: Do We Still Need Latent Classes?
Abstract
Recently, more and more Bayesian methods have been proposed for modeling heterogeneous preference structures of consumers (see, e.g., Allenby et al., J Mark Res 32:152–162, 1995, 35:384–389, 1998; Baier and Polasek, Stud Classif Data Anal Knowl Organ 22:413–421, 2003; Otter et al., Int J Res Mark 21(3):285–297, 2004). Comparisons have shown that these new methods compete well with the traditional ones where latent classes are used for this purpose (see Ramaswamy and Cohen (2007) Latent class models for conjoint analysis. In: Gustafsson A, Herrmann A, Huber (eds) Conjoint measurement – methods and applications, 4th edn. Springer, Berlin, pp 295–320) for an overview on these traditional methods). This applies especially when the prediction of choices among products is the main objective (e.g. Moore et al., Mark Lett 9(2):195–207, 1998; Andrews et al., J Mark Res 39:479–487, 2002a; 39:87–98, 2002b; Moore, Int J Res Mark 21:299–312, 2004; Karniouchina et al., Eur J Oper Res 19(1):340–348, 2009, with comparative results). However, the question is still open whether this superiority still holds when the latent class approach is combined with the Bayesian one. This paper responds to this question. Bayesian methods with and without latent classes are used for modeling heterogeneous preference structures of consumers and for predicting choices among competing products. The results show a clear superiority of the combined approach over the purely Bayesian one. It seems that we still need latent classes for conjoint analysis-based predictions.
Daniel Baier
Non-additive Utility Functions: Choquet Integral Versus Weighted DNF Formulas
Abstract
In the context of conjoint analysis, a consumer’s purchase preferences can be modeled by means of a utility function that maps an attribute vector describing a product to a real number reflecting the preference for that product. Since simple additive utility functions are not able to capture interactions between different attributes, several types of non-additive functions have been proposed in recent years. In this paper, we compare two such model classes, namely the (discrete) Choquet integral and weighted DNF formulas as used in a logic-based query language called CQQL. Although both approaches have been developed independently of each other in different fields (decision analysis and information retrieval), they are actually quite similar and share several commonalities. By developing a conceptual link between the two approaches, we provide new insights that help to decide which of the two alternatives is to be preferred under what conditions.
Eyke Hüllermeier, Ingo Schmitt
A Symmetry Test for One-Mode Three-Way Proximity Data
Abstract
Recently, several major advances in models of asymmetric proximity data analysis have occurred. These models usually do not deal with the relationships among three or more objects, but instead, those between two objects. However, there exist some approaches for analyzing one-mode three-way asymmetric proximity data that represent triadic relationships among three objects. Nonetheless, a method that evaluates the asymmetry of one-mode three-way asymmetric proximity data has not yet been proposed. There is no measure for judging the necessity of a symmetric model, reconstructed method, or asymmetric model analysis. The present study proposes a method that evaluates the asymmetry of one-mode three-way proximity data. In a square contingency table, a symmetry test is studied to check whether the data are symmetric. We propose a method that extends this symmetry test for square contingency tables to one-mode three-way proximity data.
Atsuho Nakayama, Hiroyuki Tsurumi, Akinori Okada
Analysis of Conditional and Marginal Association in One-Mode Three-Way Proximity Data
Abstract
The purpose of this study was to examine the necessity for one-mode three-way multidimensional scaling analysis. In many cases, the results of the analysis of one-mode three-way multidimensional scaling are similar to those of one-mode two-way multidimensional scaling for lower dimensions, and, in fact, multidimensional scaling can be used for low dimensional analysis. Our results demonstrated that at lower dimensionality, triadic relationships represented by the results of one-mode three-way multidimensional scaling were almost consistent with the dyadic relationships derived from one-mode two-way multidimensional scaling. However, triadic relationships differ from dyadic relationships in analyses of higher dimensionality. The degree of coincidence obtained for one-mode three- and two-way multidimensional scaling revealed that triadic relationships can only be represented by one-mode three-way multidimensional scaling; specifically, triadic relationships based on conditional associations must be separately explained in terms of marginal associations for higher dimensionality analysis.
Atsuho Nakayama
Analysis of Asymmetric Relationships Among Soft Drink Brands
Abstract
Brand switching data among eight soft drink brands were analyzed. The data are represented by an 8 × 8 brand switching matrix. The brand switching matrix is inevitably asymmetric, because the relationship from brand j to brand k is not necessarily equal to the relationship from brand k to brand j. The brand switching matrix was analyzed by asymmetric multidimensional scaling based on singular value decomposition. The four-dimensional result was chosen as the solution. The solution gives the outward tendency, which represents the strength of switching from a corresponding brand to the other brands along each dimension, and the inward tendency, which represents the strength of switching to a corresponding brand from the other brands along each dimension. The solution disclosed that the differences between diet and non-diet brands as well as between cola and lemon-lime brands played important roles in the brand switching.
Akinori Okada
Automatic Regularization of Factorization Models
Abstract
Many recent machine learning approaches for prediction problems over categorical variables are based on factorization models, e.g. matrix or tensor factorization. Due to the large number of model parameters, factorization models are prone to overfitting and typically Gaussian priors are applied for regularization. Finding proper values for the regularization parameters is usually done with an expensive grid-search using holdout validation data. In this work, two approaches are presented where regularization values are found without increasing computational complexity. The first one is based on interweaving optimization of model parameters and regularization in stochastic gradient descent algorithms. Secondly, a two-level Bayesian model to integrate regularization values into inference is shortly discussed.
Steffen Rendle
Three-Way Data Analysis for Multivariate Spatial Time Series
Abstract
We discuss several methods to realize three-way (three mode) approaches to clustering using the INDCLUS model and multidimensional scaling using the INDSCAL model, which assumes that the objects are embedded in a discrete or continuous space common to all data, including individual differences obtained by weighting each dimension. We apply some effective dynamic graphical approaches using two methods to perform a time-space structural analysis for multivariate spatial time series. The clustering and scaling of multivariate spatial time series consider: (1) the spatial nature of the objects to be clustered geometrically (discrete); (2) the characteristics of the feature space with the time series (continuous); (3) the latent structure between space and time. The last aspect is addressed using dynamic graphics with a matrix-type presentation. We can simultaneously observe the spatial nature, move the feature space and can zoom in/out of the results using a suitable size. The proposed analysis can be applied to the classification and scaling of the prefectures of Japan on the basis of the observed dynamics of some safety indicators.
Mitsuhiro Tsuji, Hiroshi Kageyama, Toshio Shimokawa

Applications

Frontmatter
Assessment of the Relationship Between Native Thoracic Aortic Curvature and Endoleak Formation After TEVAR Based on Linear Discriminant Analysis
Abstract
In the field of surgery treatment, thoracic endovascular aortic repair has recently gained popularity, but this treatment often causes an adverse clinical side effect called endoleak. The risk prediction of endoleak is essential for pre-operative planning (Nakatamari et al., J Vasc Interv Radiol 22(7):974–979, 2011). In this study, we focus on a quantitative curvature in the morphology of a patient’s aorta, and predict the risk of endoleak formation through linear discriminant analysis. Here, we objectively evaluate the relationship between the side effect after stent-graft treatment for thoracic aneurysm and a patient’s native thoracic aortic curvature. In addition, based on the sample influence function for the average of discriminant scores in linear discriminant analysis, we also perform statistical diagnostics on the result of the analysis. We detected the influential training samples to be deleted to realize improved prediction accuracy, and made subsets of all of their possible combinations. Furthermore, by considering the minimum misclassification rate based on leave-one-out cross-validation in Hastie et al. (The elements of statistical learning. Springer, New York, 2001, pp. 214–216) and the minimum number of training samples to be deleted, we deduced the subset to be excluded from training data when we develop the target classifier. From this study, we detected an important part of the native thoracic aorta in terms of risk prediction of endoleak occurrence, and identified influential patients for the result of the discrimination.
Kuniyoshi Hayashi, Fumio Ishioka, Bhargav Raman, Daniel Y. Sze, Hiroshi Suito, Takuya Ueda, Koji Kurihara
Fold Change Classifiers for the Analysis of Gene Expression Profiles
Abstract
The classification of gene expression data is often based on profiles containing thousands of features. These features represent the abundance of RNA molecules related to a particular gene. Most state of the art algorithms in this field like random forests or boosting ensembles can be seen as combination strategies for single threshold classifiers. The structure of these classifiers is beneficial in these high-dimensional settings as feature reduction is possible which also allows for a direct semantic and syntactic interpretation. A single ray, the half-open interval representing one class, compares a single expression value of the profile with a threshold. In this work an alternative base classifier, the fold change classifier, is discussed. The classifier compares two expression values of the same sample. We analyze fold change classifiers as unweighted ensembles of type majority or unanimity vote. A sample compression bound for unweighted ensembles of fold change classifiers is also given.
Ludwig Lausser, Hans A. Kestler
An Automatic Extraction of Academia-Industry Collaborative Research and Development Documents on the Web
Abstract
This research focuses on an automatic extraction method of Japanese documents describing University-Industry (U-I) relations from the Web. The method proposed here consists of a preprocessing step for Japanese texts and a classification step with a SVM. The feature selection process is especially tuned up for U-I relations documents. A U-I document extraction experiment has been conducted and the features found to be relevant for this task are discussed.
Kei Kurakawa, Yuan Sun, Nagayoshi Yamashita, Yasumasa Baba
The Utility of Smallest Space Analysis for the Cross-National Survey Data Analysis: The Structure of Religiosity
Abstract
The purpose of this paper is to illustrate the utility of Smallest Space Analysis (SSA) developed by Louis Guttman using the examples of National Religion Surveys conducted in Japan (2007) and Germany (2008) by the research team organized by the author. By conducting a data analysis of these two surveys, we try to examine the similarities and differences in religiosity between people in Japan and Germany. Smallest Space Analysis (SSA) demonstrates its usefulness in exploring the characteristics of religiosity in different countries from a comparative perspective.
Kazufumi Manabe
Socio-economic and Gender Differences in Voluntary Participation in Japan
Abstract
The aim of the present paper is to examine the relationship among participation in voluntary association, socio-economic position, and gender. Based upon a nationally representative sample in Japan in 2005 (N = 2,827), we classify a variety of voluntary organizations in terms of membership through which social integration/cohesion and social exclusion operate among different groups – such as social class, education, age, and gender. Correspondence analysis of voluntary participation data revealed that membership is differenciated by gender. It was confirmed that the distinction between old and new organizations seems to be valid in the Japanese context. The implications of the results are also discussed in terms of gender inequality and segregation not only in the economic arena but also in the society as a whole.
Miki Nakai
Estimating True Ratings from Online Consumer Reviews
Abstract
Online consumer reviews have emerged in the last decade as a promising starting point for monitoring and analyzing individual opinions about products and services. Especially the corresponding “star” ratings are frequently used by marketing researchers to address various aspects of electronic word-of-mouth (eWOM). But there also exist several studies which raise doubts about the general reliability of posted ratings. Against this background, we introduce a new framework based on the Beta Binomial True Intentions Model suggested by Morrison (J Mark Res 43(2):65–74, 1979) to accommodate the possible uncertainty inherent in the ratings contained in online consumer reviews. We show that, under certain conditions, the suggested framework is suitable to estimate “true” ratings from posted ones which proves advantageous in the case of rating-based predictions, e.g. with respect to the willingness to recommend a product or service. The theoretical considerations are illustrated by means of synthetic and real data.
Diana Schindler, Lars Lüpke, Reinhold Decker
Statistical Process Modelling for Machining of Inhomogeneous Mineral Subsoil
Abstract
Because in the machining process of concrete, tool wear and production time are very cost sensitive factors, the adaption of the tools to the particular machining processes is of major importance.We show how statistical methods can be used to model the influences of the process parameters on the forces affecting the workpiece as well as on the chip removal rate and the wear rate of the used diamond. Based on these models a geometrical simulation model can be derived which will help to determine optimal parameter settings for specific situations.As the machined materials are in general abrasive, usual discretized simulation methods like finite elements models can not be applied. Hence our approach is another type of discretization subdividing both material and diamond grain into Delaunay tessellations and interpreting the resulting micropart connections as predetermined breaking points. Then, the process is iteratively simulated and in each iteration the interesting entities are computed.
Claus Weihs, Nils Raabe, Manuel Ferreira, Christian Rautert
Backmatter
Metadaten
Titel
German-Japanese Interchange of Data Analysis Results
herausgegeben von
Wolfgang Gaul
Andreas Geyer-Schulz
Yasumasa Baba
Akinori Okada
Copyright-Jahr
2014
Electronic ISBN
978-3-319-01264-3
Print ISBN
978-3-319-01263-6
DOI
https://doi.org/10.1007/978-3-319-01264-3

Premium Partner