1 Declarations
-
Availability of data and material We used MPG, Automobile and Adult data sets from the UCI Public Data Repository [7] as well as Airport data set from the public project Open Flights (http://openflights.org/data.html).
-
Code availability Our algorithm is implemented in Java and the source code as well as the data sets are publicly available here: https://tinyurl.com/ucp8289.
2 Introduction
-
Integration ClicoT integrates two types of information considering data compression as an optimization goal. ClicoT flexibly learns the relative importance of the two different sources of information for clustering without requiring the user to specify input parameters which are usually difficult to estimate.
-
Interpretation In contrast to most clustering algorithms, ClicoT not only provides information about which objects are assigned to which clusters, but also gives an answer to the central question why objects are clustered together. As a result of ClicoT, each cluster is characterized by a signature of cluster-specific relevant attributes providing appropriate interpretations.
-
Robustness The compression-based objective function ensures that only the truly relevant attributes are marked as cluster-specific attributes. Thereby, we avoid overfitting, enhance the interpretability and guarantee the validity of the result.
-
Usability ClicoT is convenient to be used in practice since the algorithm scales well to large data sets. Additionally, the compression-based approach avoids difficult estimation of input parameters, e.g., the number or the size of clusters.
3 Clustering mixed data types
3.1 Concept hierarchy
3.2 Cluster-specific elements
3.3 Integrative objective function
4 Algorithm
4.1 How to specify cluster-specific elements?
4.2 Probability adjustment
4.3 ClicoT algorithm
5 Related work
6 Evaluation
6.1 Mixed-type clustering of synthetic data
6.2 Experiments on real-world data
\(C_2\) | \(C_3\) | |
---|---|---|
Family | \(-\) 0.24 | 0.359 |
Wife | 0.025 | \(-\) 0.047 |
Own child | 0.111 | \(-\) 0.154 |
Husband | \(-\) 0.398 | 0.59 |
Other relative | 0.02 | \(-\) 0.028 |
No family | 0.24 | \(-\) 0.359 |
Unmarried | 0.074 | \(-\) 0.105 |
Not in family | 0.165 | \(-\) 0.253 |
-
Case 1\(x_i\) and \(x_j\) belong to the same clusters of C and the same category of P
-
Case 2\(x_i\) and \(x_j\) belong to the same clusters of C but different categories of P
-
Case 3\(x_i\) and \(x_j\) belong to different clusters of C but the same category of P
-
Case 4\(x_i\) and \(x_j\) belong to different clusters of C and different categories of P