We present a set of algorithms implementing multidimensional scaling (MDS) for large data sets. MDS is a family of dimensionality reduction techniques using a $$n \times n$$ n × n distance matrix as input, where n is the number of individuals, and …
The paper discusses eleven internal evaluation criteria that can be used in the area of hierarchical clustering of categorical data. The criteria are divided into two distinct groups based on how they treat the cluster quality: variability- and …
verfasst von:
Zdenek Sulc, Jaroslav Hornicek, Hana Rezankova, Jana Cibulkova
Multi-view stacking is a framework for combining information from different views (i.e. different feature sets) describing the same set of objects. In this framework, a base-learner algorithm is trained on each view separately, and their …
verfasst von:
Wouter van Loon, Marjolein Fokkema, Botond Szabo, Mark de Rooij
This work presents a novel undersampling scheme to tackle the imbalance problem in multi-label datasets. We use the principles of the natural nearest neighborhood and follow a paradigm of label-specific undersampling. Natural-nearest neighborhood …
This paper proposes a fuzzy C-medoids-based clustering method with entropy regularization to solve the issue of grouping complex data as interval-valued time series. The dual nature of the data, that are both time-varying and interval-valued …
verfasst von:
Vincenzina Vitale, Pierpaolo D’Urso, Livia De Giovanni, Raffaele Mattera
Clustering ensemble combines several fundamental clusterings with a consensus function to produce the final clustering without gaining access to data features. The quality and diversity of a vast library of base clusterings influence the …
Finite mixtures of regressions (FMRs) are powerful clustering devices used in many regression-type analyses. Unfortunately, real data often present atypical observations that make the commonly adopted normality assumption of the mixture components …
With the significant increase of interactions between individuals through numeric means, clustering of nodes in graphs has become a fundamental approach for analyzing large and complex networks. In this work, we propose the deep latent position …
verfasst von:
Dingge Liang, Marco Corneli, Charles Bouveyron, Pierre Latouche
The Bayesian information criterion (BIC), defined as the observed data log likelihood minus a penalty term based on the sample size N, is a popular model selection criterion for factor analysis with complete data. This definition has also been …
verfasst von:
Jianhua Zhao, Changchun Shang, Shulan Li, Ling Xin, Philip L. H. Yu
To measure the degree of agreement between R observers who independently classify n subjects within K categories, various kappa-type coefficients are often used. When R = 2, it is common to use the Cohen' kappa, Scott's pi, Gwet’s AC1/2, and …
In the era of climate change, the distribution of climate variables evolves with changes not limited to the mean value. Consequently, clustering algorithms based on central tendency could produce misleading results when used to summarize spatial …
Functional logistic regression is a popular model to capture a linear relationship between binary response and functional predictor variables. However, many methods used for parameter estimation in functional logistic regression are sensitive to …
verfasst von:
Berkay Akturk, Ufuk Beyaztas, Han Lin Shang, Abhijit Mandal
A spatial point pattern is a collection of points observed in a bounded region of the Euclidean plane or space. With the dynamic development of modern imaging methods, large datasets of point patterns are available representing for example …
Persistent homology is a methodology central to topological data analysis that extracts and summarizes the topological features within a dataset as a persistence diagram. It has recently gained much popularity from its myriad successful …
A key point to assess statistical forecasts is the evaluation of their predictive accuracy. Recently, a new measure, called Rank Graduation Accuracy (RGA), based on the concordance between the ranks of the predicted values and the ranks of the …
This paper addresses the two-class classification problem for data with rare and weak signals, under the modern high-dimension setup $$p>>n$$ p > > n . Considering the two-component mixture of Gaussian features with different random mean vector of …
In modern data analysis, sparse model selection becomes inevitable once the number of predictor variables is very high. It is well-known that model selection procedures like the Lasso or Boosting tend to overfit on real data. The celebrated …
Modal clustering is an unsupervised learning technique where cluster centers are identified as the local maxima of nonparametric probability density estimates. A natural algorithmic engine for the computation of these maxima is the mean shift …
The Ultimate Host Economies (UHEs) of a given country are defined as the ultimate destinations of Foreign Direct Investment (FDI) originating in that country. Bilateral FDI statistics struggle to identify them due to the non-negligible presence of …