Rock: A robust clustering algorithm for categorical attributes☆
Abstract
Clustering, in data mining, is useful to discover distribution patterns in the underlying data. Clustering algorithms usually employ a distance metric based (e.g., euclidean) similarity measure in order to partition the database such that data points in the same partition are more similar than points in different partitions. In this paper, we study clustering algorithms for data with boolean and categorical attributes. We show that traditional clustering algorithms that use distances between points for clustering are not appropriate for boolean and categorical attributes. Instead, we propose a novel concept of links to measure the similarity/proximity between a pair of data points. We develop a robust hierarchical clustering algorithm ROCK that employs links and not distances when merging clusters. Our methods naturally extend to non-metric similarity measures that are relevant in situations where a domain expert/similarity table is the only source of knowledge. In addition to presenting detailed complexity results for ROCK, we also conduct an experimental study with real-life as well as synthetic data sets to demonstrate the effectiveness of our techniques. For data with categorical attributes, our findings indicate that ROCK not only generates better quality clusters than traditional algorithms, but it also exhibits good scalability properties.
References (14)
- R. Agrawal et al.
Fast similarity search in the presence of noise, scaling, and translation in time-series databases
- D. Coppersmith et al.
Matrix multiplication via arithmetic progressions
- T.H. Cormen et al.
- R.O. Duda et al.
- M. Ester et al.
A density-based algorithm for discovering clusters in large spatial database with noise
- M. Ester et al.
A database interface for clustering in large spatial databases
- S. Guha et al.
CURE: A clustering algorithm for large databases
Cited by (723)
On the relative value of clustering techniques for Unsupervised Effort-Aware Defect Prediction
2024, Expert Systems with ApplicationsUnsupervised Effort-Aware Defect Prediction (EADP) uses unlabeled data to construct a model and ranks software modules according to the software feature values. Xu et al. (JSS 2021) conducted an exploration of clustering techniques for unsupervised defect prediction and found that several clustering methods exhibit better performance on the F1@20% effort-aware metric. However, their conclusion may not be convincing, as they did not take into account the impact of the Initial False Alarms (IFA) metric on unsupervised EADP. Furthermore, their study did not compare with the state-of-the-art supervised EADP models. To further investigate clustering techniques for unsupervised EADP more comprehensively, we explore the performance of 22 clustering techniques for unsupervised EADP using three classification metrics and six effort-aware metrics. The experimental results demonstrate that (1) the best clustering technique for unsupervised EADP, K-medoids, can significantly reduce the IFA of the ManualUp method to an acceptable range. In contrast, the clustering techniques recommended by Xu et al. exhibit a high IFA value that cannot be deemed acceptable by testing teams; (2) K-medoids performs better than some supervised EADP methods, especially on metrics such as IFA and PMI@20% (Proportion of Modules Inspected when inspecting the top 20% lines of code); (3) better classification performance of clustering techniques could lead to better effort-aware performance. In summary, we recommend using the K-medoids clustering technique for unsupervised EADP and suggest that future research devote more effort to exploring better-unsupervised clustering techniques. In support of reproducibility and future research, we provide the source code used in our study (https://github.com/Andre-Yang816/Clustering4UEADP).
An adaptive density clustering approach with multi-granularity fusion
2024, Information FusionThe real-world dataset exhibits diversity, incorporating instances with complex shapes and significant differences in density hierarchy, potentially disrupted by noise. However, most clustering algorithms typically rely on single-granularity fusion, requiring the pre-setting of global parameters for the entire dataset. Nevertheless, these global parameters may not adequately adapt to clusters with varying hierarchies or shapes, consequently reducing the clustering effectiveness. Therefore, we propose an adaptive density clustering approach with multi-granularity fusion. This approach characterizes the dataset with multi-granularity, forming natural granular-ball. After processing these natural granular-ball, overlapping ones are fused to yield the final clustering result. The entire approach adeptly identifies datasets with significant differences in shape or density hierarchy and exhibits a certain level of robustness. All codes have been released at https://github.com/xjnine/NGBC.
“Whatever It Takes!” How tonality of TV-news affected government bond yield spreads during the European debt crisis
2024, European Journal of Political EconomyWere government bond risk premia affected by the media in addition to the effects of major events? Revisiting the European debt crisis, we analyze the role of television news in the rise and re-convergence of GIIPS bond spreads vis-à-vis Germany from 2007 to 2016. We use a dataset of more than one million human-coded news items from leading newscasts worldwide to identify over 25,000 news on the Eurozone and country-specific economic topics. Our findings emphasize the relevance of the tonality of news, such that an increasing share of positive (negative) news correlates with a decrease (increase) in spreads. Content-based endogenous clustering of news highlights the importance of news about institutions providing stability and “international financial support” to distressed countries in reducing bond spreads. Moreover, weekend news enables us to establish a causal link between country-specific news coverage and changes in spreads on the subsequent trading day.
Boosting cluster tree with reciprocal nearest neighbors scoring
2024, Engineering Applications of Artificial IntelligenceClustering plays a pivotal role in knowledge processing, knowledge bases, and expert systems, enabling AI systems to acquire knowledge effectively. Hierarchical clustering, in particular, offers an intelligent approach to represent knowledge hierarchically by transforming raw data into one/multiple tree-shaped components. However, a notable difficulty arises when attempting to pinpoint appropriate representative points within lower levels of the cluster tree. These points are of paramount importance, as they serve as the roots for subsequent aggregation within the upper levels of the cluster tree. Traditional hierarchical clustering algorithms have relied on rudimentary techniques to select these representative points, which may not provide an adequate representation. Consequently, the resulting cluster tree often falls short in terms of empirical performance. To address this shortcoming, we proposed an innovative hierarchical clustering algorithm in this paper. The proposed algorithm is designed to efficiently identify the representative point within each sub-minimum-spanning-tree during the construction of the cluster tree, achieved by topology-based scoring the reciprocal nearest data points. Rigorous testing on UCI datasets has demonstrated the superior clustering accuracy (measured by Rand Index and Normalized Mutual Information) of our proposed algorithm compared to other benchmark algorithms. Further analysis reveals that our algorithm boasts a time-complexity and a space-complexity, indicating its scalability and efficiency in handling large-scale data with minimal time and storage costs. Importantly, our algorithm’s ability to process up to two million data points on a standard personal computer underscores its cost-effectiveness.
Interdependence analysis on heterogeneous data via behavior interior dimensions
2023, Knowledge-Based SystemsInterdependent dimensions including categorical and continuous variables can be seen commonly as heterogeneous behavioral data in the real world. Mixed-type objects are more or less associated in terms of certain coupling relationships. The usual representation of such behavioral data is an information table with explicit behavior exterior dimensions (i.e. the original attributes to describe data heterogeneity), assuming the independence of dimensions and the independence of objects. However, both variables and objects are actually very often interdependent on one another either explicitly or implicitly in functional and semantic manners. Limited research has been done in analyzing such interactions among dimensions and those relationships among objects, leading to the learning results to be more local than global. This paper proposes the interdependence analysis to capture the functional multifarious relationships among attributes and among objects in heterogeneous data by addressing the coupling context and coupling weights in unsupervised learning. Such global couplings consider the interactions within discrete dimensions, within numerical attributes and across them, as well as the relationships within an individual object and between multiple objects, to form the attribute-based and object-based coupled data representation schemes based on feature conversion and neighborhood calculation. In addition, we interpret both the representation models via implicit behavior interior dimensions (i.e. the newly defined attributes to model data interdependence) to explain the intrinsic rationales for the superiority of our proposed methods. This work explicitly models the coupling of multiple attributes and the coupling of multiple objects for heterogeneous data sets, demonstrated by various data mining and machine learning applications, such as cluster structure analysis, data clustering evaluation, and data density comparison. Moreover, the sensitivity study is carried out to tune the neighborhood parameter and weight parameter, and the scalability analysis is explored to test the robustness of both models. Extensive experiments on a series of synthetic data sets and multiple UCI data sets show that our proposed framework can effectively capture the global couplings of both heterogeneous variables and mixed-type objects, and is superior to the traditional way as well as the state-of-the-art approaches, which is also verified by statistical analysis.
Effective and efficient community search with size constraint on bipartite graphs
2023, Information SciencesCommunity search on bipartite graphs has been extensively studied in suspicious-group detection and team formation. However, the existing studies focused on the cohesiveness of the community, but ignored the size constraint on bipartite graphs, potentially leading to large community sizes and high costs. In this study, a size-constrained ()–community (SCC) containing a query vertex on a bipartite graph was investigated, where the upper layer size of the community cannot exceed threshold s and the lower layer size cannot exceed threshold t. For supporting SCC search in different situations, two search methods—peeling and expansion—are proposed by peeling from the ()-core containing the query vertex and expanding from the query vertex respectively. An efficient lower bound based on degree gap is proposed by terminating unpromising search branches early to increase the efficiency of the community search. The experimental results indicated that the proposed methods can be used to find communities within the size thresholds, with the efficiency of the search increased based on the lower bound.
- ☆
Recommended by Felipe Carino.