Skip to main content

2021 | Buch

Data Analysis and Rationality in a Complex World

herausgegeben von: Theodore Chadjipadelis, Berthold Lausen, Angelos Markos, Tae Rim Lee, Angela Montanari, Rebecca Nugent

Verlag: Springer International Publishing

Buchreihe : Studies in Classification, Data Analysis, and Knowledge Organization

insite
SUCHEN

Über dieses Buch

This volume presents the latest advances in statistics and data science, including theoretical, methodological and computational developments and practical applications related to classification and clustering, data gathering, exploratory and multivariate data analysis, statistical modeling, and knowledge discovery and seeking. It includes contributions on analyzing and interpreting large, complex and aggregated datasets, and highlights numerous applications in economics, finance, computer science, political science and education. It gathers a selection of peer-reviewed contributions presented at the 16th Conference of the International Federation of Classification Societies (IFCS 2019), which was organized by the Greek Society of Data Analysis and held in Thessaloniki, Greece, on August 26-29, 2019.

Inhaltsverzeichnis

Frontmatter
PerioClust: A Simple Hierarchical Agglomerative Clustering Approach Including Constraints

PerioClust is a hierarchical agglomerative clustering (HAC) method including temporal (resp. spatial) ordering constraints. This new semi-supervised learning algorithm is designed to consider two potentially error-prone sources of information associated with the same observations. One reflects dissimilarities in the “feature space” and the other the temporal (resp. spatial) constraint structure between the observations. A distance-based approach is adopted to modify the distance measure in the classical HAC algorithm using a convex combination to take into account the two initial dissimilarity matrices. The choice of the mixing parameter is, therefore, the key point. We define a criterion based on cophenetic distances, as well as a resampling procedure to ensure the good robustness of the proposed clustering method. The dendrogram associated with this HAC can be interpreted as the result of a compromise between each source of information analysed separately. We illustrate our clustering method on two real data sets: (i) an archaeological one containing temporal information, (ii) a socio-economical one containing geographical information.

Lise Bellanger, Arthur Coulon, Philippe Husi
What Was Really the Case? Party Competition in Europe at the Occasion of the 2019 European Parliament Elections

The main aim of the paper is to analyse political competition in EU member states at the occasion of the 2019 European Parliament elections. At the core of our analysis are both the priorities of the national parties campaigning for the 2019 European elections and the manifestos of the transnational party groups, each consisting of national member parties from the 28 member states of the European Union. By comparing the major priorities of national actors/parties and those of the European political groups, we will be able to gauge out whether they share different or same dimensions of policy. More broadly, we will depict whether the dynamism in policy competition at the national level affects EP political groups or vice versa. The analysis is implemented through the use of correspondence analysis. Through this approach, the axes of political competition are realized.

Theodore Chadjipadelis, Eftichia Teperoglou
A Fast Electric Vehicle Planner Using Clustering

Over the past few years, several studies have considered the problem of Electric Vehicle Path Planning with intermediate recharge (EVPP-R) that consists of finding the shortest path between two given points by traveling through one or many charging stations, without exceeding the vehicle’s range. Unfortunately, the exact solution to this problem has a high computational cost. Therefore, speedup techniques are generally necessary (e.g., contraction hierarchies). In this paper, we propose and evaluate a new fast and intuitive graph clustering technique, which is applied on a real map with charging station data. We show that by grouping nearby stations, we can reduce the number of stations considered by a factor of 13 and increase the speed of computation by a factor of 35, while having a very limited trade-off increase, of less than $$1\%$$ 1 % , on the average journey duration time.

Jaël Champagne Gareau, Éric Beaudry, Vladimir Makarenkov
A Generalized Coefficient of Determination for Mixtures of Regressions

One of the challenges in cluster analysis is the evaluation of the obtained clustering results without using auxiliary information. To this end, a common approach is to use internal validity criteria. For mixtures of linear regressions whose parameters are estimated via the maximum likelihood approach, we propose a three-term decomposition of the total sum of squares as a starting point to define some internal validity criteria. Exploiting this decomposition, local and overall coefficients of determination are, respectively, defined to judge how well the model fits the data group-by-group but also taken as a whole. An application to real data illustrates the use and the usefulness of these proposals.

Roberto Di Mari, Salvatore Ingrassia, Antonio Punzo
Distance Measurement When Fuzzy Numbers Are Used. Survey of Selected Problems and Procedures

The goal is to identify and to discuss distance and dissimilarity measures calculated with fuzzy numbers. It is crucial to define the distance and dissimilarity measures for unconventional fuzzy numbers, i.e. asymmetric, overlapping triangular, with unequal width. Resulting distance measures are to be used for clustering and linear ordering of objects. The method applied consists of an attempt to identify and to discuss the applicability of specialised techniques for unconventional fuzzy measurement. The emphasis is put on distance (similarity and dissimilarity) of measurement concepts when unconventional fuzzy numbers are used. The use of conventional fuzzy numbers, i.e. symmetric, not overlapping triangular, with equal width is limited when Computer-Aided Web Interviewing is applied. Respondents tend to use asymmetric fuzzy numbers with overlapping shape and unequal width. Several problems arise in the multivariate statistical analysis of measurement results. Proposals from pattern recognition literature are not applicable and new methods based on directed fuzzy numbers are involved.

Józef Dziechciarz, Marta Dziechciarz-Duda
Performance Measures in Discrete Supervised Classification

The evaluation of results in Cluster Analysis frequently appears in the literature, and a variety of evaluation measures have been proposed. On the contrary, in supervised classification, particularly in the discrete case, the subject of results’ evaluation is relatively scarce in this field of the literature. This is the motto underlying this study. The evaluation of the performance of any model of supervised classification is, generally, based on the number of cases correctly or incorrectly predicted by the model. However, these measures can lead to a misleading evaluation when the data is not balanced. More recently, other types of measures have been studied as association or agreement coefficients, the Huberty index, Mutual information, and even ROC curves. Exploratory studies were conducted in this study to understand the relationship between each measure and data characteristics, namely, sample size, balance, and separability of classes. To this end, simulated data and a Beta regression model in the performance of the models were used.

Ana Sousa Ferreira, Anabela Marques
Using EVT to Assess Risk on Energy Market

The aim of this paper is to describe and measure the risk of price changes in the energy market. The risk is estimated with Conditional Value-at-Risk (CVaR) and Median Shortfall (MS) based on some types of Value-at-Risk measures: VaR, stress VaR, Incremental Risk Charge (IRC) estimated using Extreme Value Theory (EVT). These measures are calculated for time series of daily and hourly rates of return of electric energy prices from the European Energy Exchange (EEX) spot market. Based on time series from 1st January 2002 to 31st December 2016, we attempt to answer the question: which measure is the most appropriate for risk estimation on the energy market.

Alicja Ganczarek-Gamrot, Dominik Krężołek, Grażyna Trzpiot
Measuring and Testing Mutual Dependence for Functional Data

In this paper, measures of mutual independence of many-vector random processes were defined. Based on these measures, permutation tests of mutual independence of these random processes were also given. The properties of the described methods were presented using simulation studies for univariate and multivariate processes.

Tomasz Górecki, Mirosław Krzyśko, Waldemar Wołyński
Single Imputation Via Chunk-Wise PCA

The straightforward application of Principal Component Analysis (PCA) to incomplete data sets is not possible and practitioners often remove or ignore observations that contain at least one missing value. Three different strategies can be mainly distinguished to apply PCA on a data set with missing entries: (i) imputation of the missings prior to the application of PCA; (ii) obtain the PCA solution and ignore the missings; and (iii) obtain the PCA solution and deal explicitly with missings. Methods implementing the latter strategy have been reviewed and, among them, the iterative PCA (iPCA) approach has been shown to be preferable. This paper proposes a chunk-wise implementation of iPCA, suitable for tall data sets, that is, with many observations. In the proposed approach, each data chunk is imputed according to the insofar analyzed data. The proposed procedure is compared to the batch iPCA and to a naive implementation, which imputes each data chunk independently. In a series of experiments, we consider different data sets and missing data mechanisms.

Alfonso Iodice D’Enza, Francesco Palumbo, Angelos Markos
Clustering Mixed-Type Data: A Benchmark Study on KAMILA and K-Prototypes

Benchmarking in cluster analysis is the process of analyzing which clustering techniques give the best result for different types of data structures as well as setting a standard for evaluation of newer clustering methods. There are many instances of benchmarking in cluster analysis for continuous data, but only a few for mixed-type data, i.e. data sets with nominal and continuous variables. Therefore, we explore the process for benchmarking various clustering methods on simulated mixed-type data sets with varying proportions of continuous and nominal variables. For this purpose, we test a newer clustering algorithm, KAMILA, against K-prototypes and tandem analysis where data are preprocessed using multiple correspondence analysis and then clustered using K-means, fuzzy K-means, probabilistic distance clustering (PD), and Student-t mixture models.

Jarrett Jimeno, Madhumita Roy, Cristina Tortora
Exploring Social Attitudes Toward the Green Infrastructure Plan of the Drama City in Greece

A complex Green Infrastructure (GI) plan has been recently put into operation in the city of Drama located in the northeastern part of Greece, aiming at the upgrading of the environmental, bioclimatic, and economic conditions of the city downtown area. Within the project, a governance network has been established to promote active social participation and increase the project’s social acceptability. This work presents the preliminary results of the first of a series of social surveys carried out within the governance network’s function to explore the attitudes of the entrepreneurs of the area, who are expected to be heavily affected by the GI plan. A total of 117 responses were collected using a questionnaire and joint dimension reduction, and clustering of the data was conducted to identify the main factors comprising the entrepreneurs’ attitudes patterns toward the GI plan. These factors involve the perceived negative impacts during the project implementation phase, the potential usefulness of the GI infrastructure, and the perceived benefits after the project completion. Three groups of entrepreneurs were identified in terms of their attitudes toward the GI plan: (a) negative to change, (b) utilitarians, and (c) positive to change. Each group was profiled according to its sociodemographic characteristics.

Vassiliki Kazana, Angelos Kazaklis, Dimitrios Raptis, Efthimia Chrisanthidou, Stella Kazakli, Nefeli Zagourgini
Spatial Perception for Structured and Unstructured Data In topological Data Analysis

Recent years have witnessed the accumulation of vast amounts of data and information. It is difficult to capture the characteristics of these data spatially or visualize them robustly and stably with respect to data updates and increases using conventional methods. The purpose of this study is to systematically visualize the relationships among drugs using diverse information. While studies have conducted visualization research using structured data, such as chemical descriptors, research has not yet been performed from comprehensive viewpoints using unstructured data on efficacy, adverse events, and other phenomena. Therefore, we use a topological data analysis mapper and a spatial perception method to obtain and visualize data based on the integrated principal component score of quantitative and qualitative data. Consequently, a network composed of characteristic clusters according to drug class was shown. Findings show that heterogeneous compounds in the cluster may indicate the potential for drug repositioning. Our proposed method is an effective means of obtaining new knowledge of pharmaceuticals.

Yoshitake Kitanishi, Fumio Ishioka, Masaya Iizuka, Koji Kurihara
Text, Content and Data Analysis of Journal Articles: The Field of International Relations

Term frequencies is the basic, if not the main, measure that springs out, during the process of mapping latent information in a text corpus. This paper addresses the issue of exploring a set of textual documents based on their metadata and term frequencies, by introducing the mixed use of text mining and data analysis methods for analyzing social science journal articles. In particular, this survey links the quantitative research of scientific discourse—through specific tools of data analysis—with research on the development of a scientific field, namely International Relations. Preliminary results on field-related published journal articles demonstrate the effectiveness of the proposed methodology.

Nikos Koutsoupias, Kyriakos Mikelis
Quantile Measures of Extreme Risk on Metals Market

During the period of dynamic economic development, certain events that may cause disruptions at the level of various economic processes are observed. These disruptions usually bring significant consequences. There are many methods for identifying rare events. Some of them include the so-called extreme statistics. From a statistical point of view, rare events are associated with high order quantiles for probability distributions that allow for determining the level of risk for which the probability of occurrence of a risky event is extremely low. The paper focuses on the possibility of using the Hill estimator and its modifications to assess the risk of rare events. Results of the analysis for selected theoretical models are compared in the paper. The empirical analysis was conducted on the example of assets from the precious metals market, i.e. gold and silver.

Dominik Krężołek, Grażyna Trzpiot
Evaluation of Text Clustering Methods and Their Dataspace Embeddings: An Exploration

Fair evaluation of text clustering methods needs to clarify the relations between (1) preprocessing, resulting in raw term occurrence vectors, (2) data transformation, and (3) method in the strict sense. We have tried to empirically compare a dozen well-known methods and variants in a protocol crossing three contrasted open-access corpora in a few tens dataspaces with different metrics and/or matrix decompositions. We compared the resulting clusterings to their supposed “ground-truth” classes by means of four usual indices. The results show both a confirmation of well-established implicit combinations and good performances of unexpected ones, mostly in spectral or kernel dataspaces. The rich material resulting from these some 600 runs includes a wealth of intriguing facts, which needs further research on the specificities of text corpora in relation to methods and dataspaces.

Alain Lelu, Martine Cadot
Specification of Basis Spacing for Process Convolution Gaussian Process Models

Gaussian process (GP) models have been widely used for statistical modeling of point-referenced data in many scientific applications, including regression, classification, and clustering problems. Standard specification of GP models is computationally inefficient for applications with a large sample size. One solution is to construct the GP by convolving a smoothing kernel with a discretized White noise process, which requires choosing the number of bases. The distance between adjacent bases plays a key role in model accuracy. In this paper, we perform a series of simulations to find a general rule for the basis spacing required for an accurate representation of a discrete process convolution GP model. Under certain common conditions, we find that using a basis spacing of one-quarter the practical range of the process works well in practice.

Waley W. J. Liang, Herbert K. H. Lee
Estimation of Classification Rules From Partially Classified Data

We consider the situation where the observed sample contains some observations whose class of origin is known (that is, they are classified with respect to the g underlying classes of interest), and where the remaining observations in the sample are unclassified (that is, their class labels are unknown). For class-conditional distributions taken to be known up to a vector of unknown parameters, the aim is to estimate the Bayes’ rule of allocation for the allocation of subsequent unclassified observations. Estimation on the basis of both the classified and unclassified data can be undertaken in a straightforward manner by fitting a g-component mixture model by maximum likelihood (ML) via the EM algorithm in the situation where the observed data can be assumed to be an observed random sample from the adopted mixture distribution. This assumption applies if the missing-data mechanism is ignorable in the terminology pioneered by Rubin (1976). An initial likelihood approach was to use the so-called classification ML approach whereby the missing labels are taken to be parameters to be estimated along with the parameters of the class-conditional distributions. However, as it can lead to inconsistent estimates, the focus of attention switched to the mixture ML approach after the appearance of the EM algorithm (Dempster et al. 1977). Particular attention is given here to the asymptotic relative efficiency (ARE) of the Bayes’ rule estimated from a partially classified sample. Lastly, we consider briefly some recent results in situations where the missing label pattern is non-ignorable for the purposes of ML estimation for the mixture model.

Geoffrey McLachlan, Daniel Ahfock
Correspondence Analysis and Kriging: Projection of Quantitative Information on the Factorial Maps

In this study, a methodological scheme is proposed for the combined use of Analyse Factorielle des Correspondances—AFC (or Correspondence Analysis) and the Ordinary Kriging method to display values of quantitative variables as supplementary points (or as “supplementary data”) onto the factorial maps (or planes) resulting from the application of AFC to a contingency table of two categorical variables. The proposed method can also be generalized in the case of Multiple Correspondence Analysis (Analyse des Correspondances Multiples). The kriging method is widely used as one of the most effective spatial interpolation techniques. The proposed methodological scheme is demonstrated using hypothetical data from a 5 $$\times $$ × 4 contingency table (sites $$\times $$ × crops). Fertilizer mean costs will be used as supplementary points or “supplementary data”. Also, a specific data coding scheme is proposed aiming at a better presentation and interpretation of the graphical results.

George Menexes, Thomas Koutsos
Intertemporal Exploratory Analysis of E-Commerce From Greek Households from Official Statistics Data

E-commerce worldwide is transforming the economy at the macro level while also affecting the consumption habits of households. This paper aims to map the effects of these changes through exploratory data analysis. For this purpose, data from Official Statistics were analyzed, as they were collected via sample surveys of Greek Statistical Authority (ELSTAT) from 2009 to 2018. This work is of multiple interest, as not only the phenomenon under study is an area of general scientific interest, but also the period of data collection includes the time horizon of the beginning of the economic crisis in Greece.

Stratos Moschidis, Athanasios Thanopoulos
Benchmarking in Cluster Analysis: A Study on Spectral Clustering, DBSCAN, and K-Means

We perform a benchmarking study to identify the advantages and the drawbacks of Spectral Clustering and Density-Based Spatial Clustering of Applications with Noise (DBSCAN). We compare the two methods with the classic K-means clustering. The methods are performed on five simulated and three real data sets. The obtained clustering results are compared using external and internal indices, as well as run times. Although there is not one method that performs best on all types of data sets, we find that DBSCAN should generally be reserved for non-convex data with well-separated clusters or for data with many outliers. Spectral Clustering has better overall performance but with higher instability of the results compared to K-means, and longer run time.

Nivedha Murugesan, Irene Cho, Cristina Tortora
Detection of Topics and Time Series Variation in Consumer Web Communication Data

Consumers’ personal interests are influenced by new product strategies, such as marketing communication schemes, and these can change over time. Thus, it is important to consider temporal variation in trending consumer interests. We aimed to detect temporal variations in consumer web communication data using weight coefficients between entries and topics obtained from nonnegative matrix factorization. The weight coefficient, which indicates the strength between an entry and a topic, was modeled with a Bayesian network to capture changes in the topic over time. Bayesian networks, commonly used in a wide range of studies such as anomaly detection, reasoning, and time series prediction, build models from data using Bayesian inference for probability computations. The causations can be modeled by representing conditional dependence based on the edges in a directed graph of the Bayesian network.

Atsuho Nakayama
Classification Through Graphical Models: Evidences From the EU-SILC Data

The purpose of this work is to evaluate the level of perceived health by studying possible factors such as personal information, economic status, and use of free time. The analysis is carried out on the European Union Statistics on Income and Living Conditions (EU-SILC) survey covering 31 European countries. At this aim, we take advantage of graphical models that are suitable tools to represent complex dependence structures among a set of variables. In particular, we consider a special case of Chain Graph model, known as Chain Graph models of type IV for categorical variables. We implement a Bayesian learning procedure to discover the graph which best represents the dataset. Finally, we perform a classification algorithm based on classification trees to identify clusters.

Federica Nicolussi, Agnese Maria Di Brisco, Manuela Cazzaro
A Simulation Study for the Identification of Missing Data Mechanisms Using Visualisation

Understanding the cause of the missingness in data is a science of its own and is of great importance for the application of valid and unbiased analysis techniques for missing data. The distribution of missingness is defined by certain dependencies on either observed or missing values in a data set, and therefore, requires a multivariate visualisation to attempt to identify the missing data mechanism (MDM). Multivariate categorical data sets containing missing data entries can be separated into observed and unobserved (or missing) subsets by creating an additional category level (CL) for each variable with missing responses in the indicator matrix. Subset multiple correspondence analysis (sMCA) can then be applied to the recoded indicator matrix to obtain separate biplots for the observed and missing subsets. The sMCA biplot of missing categories enables the exploration of the missing values which could expose non-response patterns. Partitioning around medoids (pam) clustering is used to determine whether sufficient clustering structures can be identified in the sMCA biplot of missing responses. A simulation study consisting of data sets with different sample sizes are generated from three distributions. Artificial missingness is created by deleting values according to MAR and MCAR MDMs with different percentages of missing values. The influence of the underlying distribution on the outcome of the clustering techniques will be presented. The insight obtained from the simulation results provides guidelines for the identification of the MDM in real data applications.

Johané Nienkemper-Swanepoel, Niël Le Roux, Sugnet Gardner-Lubbe
Triplet Clustering of One-Mode Two-Way Proximities

Some researchers noticed that proximities of three objects are useful to disclose relationships among objects. Sometimes it is not easy to obtain one-mode three-way proximities in contrast to obtain one-mode two-way proximities. Hence, a procedure to assemble one-mode three-way proximities from one-mode two-way proximities is introduced. And a method for hierarchical clustering of the resulting one-mode three-way proximities, where three clusters (objects) form a new cluster at each step of the clustering, is introduced. The procedure is applied to one-mode two-way dissimilarities among kinship terms, and the resulting one-mode three-way dissimilarities (dissimilarities of three kinship terms) were analyzed by the method of cluster analysis for one-mode three-way dissimilarities, which is comparable to the complete linkage. The one-mode two-way dissimilarities, from which the one-mode three-way dissimilarities were assembled, were analyzed by the complete linkage cluster analysis. The comparison of the two results shows that the present analysis revealed the aspects which cannot be disclosed by the analysis using one-mode two-way cluster analysis.

Akinori Okada, Satoru Yokoyama
First-Time Voters in Greece: Views and Attitudes of Youth on Europe and Democracy

This study investigates the views, attitudes, and values of young people in Greece using a multivariate data analysis workflow. The primary objective is to investigate political mobilization and its association with various political characteristics, to perceptions toward democracy and personal moral values. The main research output is a map representing the political behavior of young first-time voters. A secondary objective is to identify the most important factors which determine their vote. The study results contribute to the voting behavior literature by revealing contemporary young voters’ typologies, visualizing political competition and dynamics of political behavior, and highlighting emerging important factors which affect voting choice.

Georgia Panagiotidou, Theodore Chadjipadelis
Comparison of Hierarchical Clustering Methods for Binary Data From SSR and ISSR Molecular Markers

Data from molecular markers, which are used to construct dendrograms based on genetic distances between different plant species, are encoded as binary data. For the construction of the dendrograms, the most commonly used linkage method is the UPGMA (Unweighted Pair Group Method with Arithmetic Mean) in combination with the squared Euclidean distance. It seems that in this scientific field, this is, the “golden standard” clustering method. In this study, a comparison of 189 clustering methods (except the “golden standard”), that is seven linkage methods in the sense that this methodological scheme is used in the vast majority of the corresponding studies by 27 appropriate distances along with the Benzécri’s chi-squared distance in combination with the Ward’s linkage method, is attempted using data originating from molecular markers applied on pear trees species and Sinapis arvensis populations. Fruit trees cluster analysis was performed using SSR markers, while for Sinapis arvensis populations’ clustering, ISSR markers were used. The results showed that the “golden standard” is not the only appropriate method for dendrogram construction based on binary data derived from molecular markers. Ten other hierarchical clustering methods could be used for the construction of dendrograms from SSR markers and thirty-seven other hierarchical clustering methods could be used for the construction of dendrograms using binary data resulted from ISSR markers.

Emmanouil D. Pratsinakis, Lefkothea Karapetsi, Symela Ntoanidou, Angelos Markos, Panagiotis Madesis, Ilias Eleftherohorinos, George Menexes
One-Way Repeated Measures ANOVA for Functional Data

In this paper, the one-way repeated measures analysis of variance for functional data is considered. For this problem, the new test statistics are obtained by integrating and taking supremum of the constructed pointwise test statistic. To approximate the null distributions of the test statistics and construct the testing procedures, different bootstrap and permutation methods are used. The performance of the new tests and their comparisons with the known testing procedures in terms of size control and power is established in simulation studies. These studies indicate that the new tests may have different finite sample properties, but they are usually more powerful than the tests proposed in the literature.

Łukasz Smaga
Flexible Clustering

Flexibility of cluster analysis is sometimes understood as the robustness of final partition of objects to the changes in the list of diagnostic variables—deleting some from the list or adding some. In this paper, we propose a procedure which makes possible to calculate a distance matrix on the basis of different subsets of variables, but the selection of variables is somehow unified. The procedure starts with the classical standardization of each variable. Before the calculation of a distance between two objects, we eliminate the variables with the largest absolute value in the first object and in the second object. If by chance the same variable is pointed for elimination for both objects, the next variable with the largest absolute value (for both objects) should be eliminated. With this procedure, each element of the distance matrix is based on the same number of variables, but the variables can be different. As an example, a data set of 17 variables describing human smart society characteristics for 28 European Union countries is used.

Andrzej Sokołowski, Małgorzata Markowska
Classification of Entrepreneurial Regimes: A Symbolic Polygonal Clustering Approach

Entrepreneurial regimes is a topic, receiving ever more research attention. Existing studies on entrepreneurial regimes mainly use common methods from multivariate analysis and some type of institutional related analysis. In our analysis, the entrepreneurial regimes are analyzed by applying a novel polygonal symbolic data cluster analysis approach. Considering the diversity of data structures in Symbolic Data Analysis (SDA), interval-valued data is the most popular. Yet, this approach requires assuming equidistribution hypothesis. We use a novel polygonal cluster analysis approach to address this limitation with additional advantages: to store more information, to significantly reduce large data sets preserving the classical variability through polygon radius and to open new possibilities in symbolic data analysis. We construct a dynamic cluster analysis algorithm for this type of data with proving main theorems and lemmata to justify its usage. In the empirical part, we use a data set of Global Entrepreneurship Monitor (GEM) for the year 2015, to construct typologies of countries based on responses to main entrepreneurial questions. The article presents a novel approach to clustering in statistical theory (with novel type of variables never accounted for) and application to a pressing issue in entrepreneurship with novel results.

Andrej Srakar, Marilena Vecco
Multidimensional Factor and Cluster Analysis Versus Embedding-Based Learning for Personalized Supermarket Offer Recommendations

Multidimensional factor and cluster analysis and embedding-based machine learning were evaluated toward a knowledge-based recommendation system for supermarket e-marketing. The goal was to produce personalized notifications on special offers, optimized per individual customer’s predicted response. To this purpose, we firstly applied Multiple Correspondence Analysis and Hierarchical Clustering to extract insights on the ordering behaviors and to identify customer classes associated with predictable preference patterns. Secondly, a neural network model based on embeddings was developed to predict the customers’ ordering actions on a personalized level at large scale. Application of the factor and cluster analysis on the Instacart dataset resulted in the identification of typical and niche patterns with prediction value. The neural network model was successfully trained to predict with satisfactory accuracy individual customers’ future orders, to be used as a basis for composing personalized recommendations.

George Stalidis, Theodosios Siomos, Pantelis I. Kaplanoglou, Alkiviadis Katsalis, Iphigenia Karaveli, Marina Delianidi, Konstantinos Diamantaras
Motivation for Participating in the Sharing Economy: The Case of Hungary

Our research focuses on the sharing economy (SE), which has gained more and more ground in recent years and is receiving increased media coverage nowadays. The use of the sharing economy spread rapidly from around 2005 significantly. Different authors define the system differently, and they analyse the participants’ motivation from different aspects. The main purpose of SE is to improve the utilisation of unused assets. Most people think that only economic factors have impact on participation, but social and environmental motivations can also be important for users. Besides, there are numerous other factors that can influence the participants of SE. The aim of our study is to analyse why people take part in sharing economy activities in Hungary. We use the Structural Equation Modelling technique to determine which are the most important motivation factors. The study employs survey data from Hungarian sharing economy users.

Roland Szilágyi, Levente Lengyel
Benchmarking Minimax Linkage in Hierarchical Clustering

Minimax linkage was first introduced by Ao et al. (2004) in 2004, as an alternative to standard linkage methods used in hierarchical clustering. Minimax linkage relies on distances to a prototype for each cluster; this prototype can be thought of as a representative object in the cluster, hence improving the interpretability of clustering results. Bien and Tibshirani analyzed properties of this method in 2011 (Bien and Tibshirani 2011), popularizing the method within the statistics community. Additionally, they performed some comparisons of minimax linkage to standard linkage methods. In an effort to expand upon their work and evaluate minimax linkage more comprehensively, we follow the guidelines for neutral benchmark studies outlined in Van Mechelen et al. (2018), focusing on thorough method evaluation via multiple performance metrics on several well-described data sets. We also make all code and data publicly available through an R package, for full reproducibility. Similarly to Bien and Tibshirani (2011), we find that minimax linkage often produces the smallest distances to prototypes, meaning that objects in a cluster are tightly clustered around their prototype. This is true across a range of values for the total number of clusters (k). However, this is not universally true, and special attention should be paid to the case when k is the true known value. For true k, minimax linkage does not always perform the best in terms of all the evaluation metrics studied, including distance to prototype.

Xiao Hui Tai, Kayla Frisoli
Clustering Binary Data by Application of Combinatorial Optimization Heuristics

We study clustering methods for binary data, first defining aggregation criteria that measure the compactness of clusters. Five new and original methods are introduced, using neighborhoods and population behavior combinatorial optimization metaheuristics: first ones are simulated annealing, threshold accepting and tabu search, and the others are a genetic algorithm and ant colony optimization. The methods are implemented, performing the proper calibration of parameters in the case of heuristics, to ensure good results. From a set of 16 data tables generated by a quasi-Monte Carlo experiment, a comparison is performed for one of the aggregations using $$L_1$$ L 1 dissimilarity, with hierarchical clustering, and a version of k-means: partitioning around medoids or PAM. Simulated annealing performs very well, especially compared to classical methods.

Javier Trejos-Zelaya, Luis Eduardo Amaya-Briceño, Alejandra Jiménez-Romero, Alex Murillo-Fernández, Eduardo Piza-Volio, Mario Villalobos-Arias
Classifying Users Through Keystroke Dynamics

The billions of users connected to the Internet together with the anonymity that each of them can have behind a computer that is a source of many risks, such as financial fraud and seduction of minors. Most methods that have been proposed to remove this anonymity are either intrusive, or violate privacy, or expensive. We propose the recognition of certain characteristics of an unknown user through keystroke dynamics, which is the way a person is typing. The evaluation of the method consists of three stages: the acquisition of keystroke dynamics data from 110 volunteers during the daily use of their device, the extraction and selection of keystroke dynamics features based on their information gain, and the testing of user characteristics recognition by training five well-known machine learning models. Experimental results show that it is possible to identify the age group, the handedness, and the educational level of an unknown user with an accuracy of 87.6, 97.0, and 84.3, respectively.

Ioannis Tsimperidis, Georgios Peikos, Avi Arampatzis
Technological Innovation and the Critical Raw Material Stock

We live in a dynamically changing world. There have been so many innovations in the last few years that raw materials have become really indispensable. The European Commission collected information in separate studies for Critical Raw Materials (CRMs) and for non-critical raw materials. This paper is based on that data. In 2011 there were 14 materials on the critical raw material list, but in 2017 this list had grown to contain 27 materials. Critical raw materials play a key role in technological innovation; they are the necessary raw materials for many innovations. In this paper, our aim is to identify groups using hierarchical cluster analysis and to identify which clusters are important for innovation. We selected three variables for cluster analysis: Economic Importance (EI), Supply Risk (SR), and End of Life recycling input rate (EoL), and we identified five homogenous groups. There is one group that seems particularly important because it includes only critical raw materials.

Beatrix Varga, Kitti Fodor
Redundancy Analysis for Binary Data Based on Logistic Responses

Redundancy Analysis (RDA) is one of the many possible methods to extract and summarize the variation in a set of response variables that can be explained by a set of explanatory variables. The main idea is to use multivariate linear regression to explain the responses as a linear function of the explanatory and then use Principal Component Analysis (PCA) or a biplot to visualize the result. When response variables are categorical (binary, nominal, or ordinal), classical linear techniques are not adequate. Some alternatives such as Distance-Based RDA have been proposed in the literature. In this paper, we propose versions of RDA based on generalized linear models with logistic responses. The natural visualization methods for our techniques are the Logistic Biplots, recently proposed. The procedures are illustrated with an application to real data.

Jose L. Vicente-Villardon, Laura Vicente-Gonzalez
Predictive Power of School Motivation Clusters in Secondary Education

In many applications of cluster analysis in educational research, the solutions found have very limited predictive power for relevant outcomes. In this paper, we explore whether the clusterings found have more predictive power (in terms of explained variance) if relevant outcomes are included in the estimation procedure, using a real-world data set on school motivation. We compare various normal mixture models with different distal outcomes involved such as no outcome variable, a single outcome, all outcomes. All models were estimated using the simultaneous estimation (one-step) procedure for distal outcomes in Latent GOLD. Partial eta squared ( $$\eta ^2_p$$ η p 2 ) was used to assess predictive power. Including relevant outcomes will in most cases increase the predictive power of the models. Furthermore, the increase in power is more substantial, in the absolute sense, when the correlation between the outcome variable and input variables is higher.

Matthijs J. Warrens, W. Miro Ebert
Metadaten
Titel
Data Analysis and Rationality in a Complex World
herausgegeben von
Theodore Chadjipadelis
Berthold Lausen
Angelos Markos
Tae Rim Lee
Angela Montanari
Rebecca Nugent
Copyright-Jahr
2021
Electronic ISBN
978-3-030-60104-1
Print ISBN
978-3-030-60103-4
DOI
https://doi.org/10.1007/978-3-030-60104-1

Premium Partner