Introduction
Clustering high dimensional data
Deep Autoencoder: challenges and issues
The contributions of our work and structure
Related work
Research methodology
Input data and preparation
- Individual features describe the mobile phone usage and interactions of an individual with contacts. (e.g. average call duration per day, standard deviation of received SMS per day at worktime, the response rate of the user). Individual behavioral features are extracted from Call Detail Records (CDRs) generated by the Syrian Telecom cellular network. CDRs log the users’ activities for billing purposes and network management.
- Spatial features describe the mobility patterns of an individual during the usage of mobile phone (e.g. entropy of visited antennas, average of mobility, daytime antenna entropy). Spatial and mobility features are extracted based on the database of cell towers and their locations as Table 1.Table 1Sample of cells and sites databaseCell identifierSite identifierLongitudeLatitude...C147S73**. ******2**. ******7...C23S119**. ******0**. ******6...C64S14**. ******1**. ******0..................
- Social network features describe the social network of an individual and compare their behavior with that of their peers. (e.g. clustering coefficient, assortativity).
GSM | Avg dur percall | Avg sms perday | Antenna entropy | Contact num | ... |
---|---|---|---|---|---|
+963********9 | 48.125721 | 1.846 | 2.7515 | 2 | ... |
+963********5 | 148.129570 | 3.620 | 4.316 | 4 | ... |
+963********8 | 72.03 | 1.37 | 2.0487 | 6 | ... |
... | ... | ... | ... | ... | ... |
GSM | Economy | Education | Health | Horoscopes | Technology | Sport | ... |
---|---|---|---|---|---|---|---|
+963********9 | 0 | 1 | 0 | 0 | 1 | 0 | ... |
+963********5 | 0 | 0 | 1 | 1 | 0 | 0 | ... |
+963********8 | 1 | 0 | 0 | 0 | 0 | 0 | ... |
+963********3 | 0 | 0 | 0 | 0 | 0 | 0 | ... |
... | ... | ... | ... | ... | ... | ... | ... |
The experimental process
- Data cleansing, Data is cleansed through processes such as filling in or removing missing values, smoothing the noisy data, or resolving the inconsistencies in the data.
- Data transformation, standardization is the key pre-processing step in data mining to standardize features or attributes values from different dynamic ranges to a specific range. Applying standardization step leads to obtain better quality, efficient and accurate clustering results. It is also important to select a specific standardization procedure, according to the nature of the datasets for the analysis [48]. In our case, the result obtained by the z-score standardization method where it is more efficient than other standardization methods. Min-Max scaling is typically done via the following formula:In Z score normalization, we perform the following formula:$$\begin{aligned} MinMax{\text{-}}score= & {} \frac{x-\min (x)}{\max (x)-\min (x)} \end{aligned}$$(1)where is the mean of the population, is the standard deviation of the population.$$\begin{aligned} Z{\text{-}}score= & {} \frac{x-\mu }{\sigma } \end{aligned}$$(2)Min–Max scales the values closer to mean. But when data outliers are important and we don’t want to lose their impact, we attempt to scale with Z score normalization.
- Dimensionality reduction, dimensionality is the number of characteristics, variables, or features that describe the dataset. These dimensions are represented as columns and the goal is to reduce their number. In most cases, those columns are correlated and therefore, there is some information that is redundant which increases the dataset’s noise. This redundant data has a negative impact on the results of training our model and therefore it’s important to use dimensionality reduction techniques such as Feature Selection and Feature Extraction as a very useful way to reduce the model’s complexity and avoid overfitting. In our case study, we applied three different scenarios which lead to different clustering models, shown in Fig. 3, described as follows:1.The baseline this process is to perform clustering on the original data set without performing any data reduction method. Therefore, the comparison between the results allows us to understand whether the reduced data sets can make the clustering accuracy better than the result without performing data reduction.2.(PCA–based) customer segmentationFirst, the PCA transformation is applied to the original data. Data is transformed into new features as a combination of the original features in a new space, compressed in a way that the most relevant information is retained. Then, k-means clustering algorithm was applied on the transformed dataset.However, applying PCA before clustering is a powerful method for clustering high dimensional data in the presence of the linear correlations in features. So, we would like to know which feature reduction method can enhance the clustering accuracy.Step1: PCA implementation Using unlabeled training dataset, we actually have a data matrix X of 100,000 × 220. The matrix representation of the information not only makes the variables difficult to visualize and interpret but also computationally taxable when perform clustering. Therefore, we apply PCA to reduce the dimensionality of the data by finding projection direction(s) that minimizes the squared errors in reconstructing the original data or equivalently maximizes the variance of the projected data. Its performance depends on the distribution of the data set and the correlation of features.PCA is mathematically defined as an orthogonal linear transformation that transforms the data to a new coordinate system such that the greatest variance by some projection of the data comes to lie on the first coordinate (called the first principal component), the second greatest variance on the second coordinate, and so on. The first step of PCA is to standardize the dataset matrix. That’s important because features or columns with larger scales compared to other columns will ultimately dominate the final principal component matrix, i.e. attributing most of the explained variance to these features. This step is performed by Standard Scaler which subtracting each column mean from each column and dividing by their respective standard deviation. However, standardization has other advantages, such as faster learning for neural networks, which will later become important when autoencoder is implemented. The second step is to calculate the covariance matrix of the standardized matrix. The covariance between the variables can be found by performing matrix multiplication between the standardized matrix and its transposition, the result matrix should be a 220 × 220 symmetric matrix. where each diagonal element is the variance of each feature and every non-diagonal element is a particular co-variance between two features. Therefore, the covariance matrix, will have the following structure:The third step is to discover the covariance matrix’s eigenvectors using eigenvalue decomposition to get a matrix whose columns are Cov-Matrix’s eigenvectors. The eigenvectors matrix has columns that representing the principal components (the new dimensions), each component is orthogonal to each other and arranged in decreasing order of explained variance.$$\begin{aligned} Cov{\text{-}}Matrix = \begin{bmatrix} var(x1) &{} cov(x1,x2) &{} \dots &{} cov(x1,x220) \\ cov(x1,x2) &{} var(x2) &{} \dots &{} cov(x2,x220) \\ \vdots &{} \vdots &{} \ddots &{} \vdots \\ cov(x1,x220) &{} cov(x2,x220) &{} \dots &{} var(x220) \end{bmatrix} \end{aligned}$$The last step is to use the transformation matrix (eigenvectors matrix), by multiplying it with the scaled data matrix to compute the principal component scores in order to find the optimal number of components which capture the greatest amount of variance in the data.In our analysis, the applied method to determine the number of principal components is to look at a Scree Plot, which is the plot of the principal components ordered from largest to the smallest. The number of components is determined at the variance drop-off point where the gain in explained variances first decreased significantly, beyond which the remaining eigenvalues are all relatively small and of comparable size [49].As shown in Fig. 4, a Scree plot with bars displaying the explained variance as a percentage of total variance by the first 20 PCs such that 90% of the variance is retained and showing variance drop-off after the third component account for roughly 28% of the total explained variance.We found that, the first three principal components explain 72% of the variation. This is an acceptably large percentage. Assume that the transformation matrix as TM-L, when reducing the dimensionality of our dataset, where L is the number of principal components such that \(\hbox {L}<220\). We can then compress the data using TM-L, where TM-L still has the same number of rows but only L columns, leading to a reduced dataset on which the clustering Algorithm is applied in an unsupervised manner, since we don’t have labels, but it’s important to reveal the contributions of each feature on the components and then proceed to the last step in the process.After applying the PCA on the original data, we conducted our experiments on the principle components (10, 20, 30, 50) that maintain the largest percentage of variance in the data and give a total variance ratio greater than 80%, we evaluated the projection loss by calculating the MSE loss function which (expresses the difference between the reconstructed data from the projected data and the original data), obtained in Table 4.Table 4The projection loss of the principal componentsPCsProjection loss (MSE)Variance percentage100.19430.80200.09970.90300.05860.94500.02600.97Step 2: Reveal the most important variables/features after a PCA analysis, The important features are those that influence the Principle components more, and therefore have a high absolute value score on them. As a way to get how components are correlated with the original features we retrieve the indexes of the most important features on each component, the large value of the contribution means that the variable contributes more to the component and it is crucial in explaining the variety in the data set.Variables that don’t correlate with any component or correlated with the last dimensions are variables with low contribution and might be eliminated to simplify the overall analysis.The importance of each feature is reflected by the magnitude of the corresponding values in the eigenvectors. First, we get the amount of variance that each PC explain using (pca explained variance ratio) then we find the most important features. To sum up, we look at the absolute values of the Eigenvectors’ components corresponding to the k largest Eigenvalues. In sklearn the components are sorted by explained variance. The larger they are these absolute values, the more a specific feature contributes to that principal component. Table 5 describes the variables according to their contributions to the principal components expressed in percentage.So, for each principal component we can see which variables contribute most to that component as Table 6.Table 5Variables contributions to the principal componentsAvg dur percallAvg dur percall outAvg dur percall inStd dur percallStd dur percall out...PC10.0726040.0658330.0591380.075830.07131... %PC2− 0.1008− 0.08562− 0.0872− 0.09548− 0.07959... %PC3− 0.08335− 0.05488− 0.08701− 0.08033− 0.0485... %Table 6The most important features for each componentImportant Var1Important Var2Important Var3PC1Avg dur perdayStd dur perdayAvg dur perday workdayPC2Avg dur percall workday outStd dur perdayAvg dur percall workday inPC3Avg call perday daylightAvg call perday workdayAvg call perday worktimeStep 3: Find the Clusters, K-Means clustering algorithm is applied in this step to cluster the reduced dataset represented by top 20 PCs as new feature space. First, we applied the elbow method to determine the optimal number of clusters for k-means clustering.. for PCA, the k-means scree plot below in Fig. 5The Elbow Method is one of the most popular methods to determine this optimal value of k. The idea of the elbow method is to run k-means clustering on the dataset for a range of values of k (say, k from 1 to 15 in our case), and for each value of k calculate the sum of squared errors (SSE). Then, plot a line chart of the SSE for each value of k. If the line chart looks like an arm, then the “elbow” on the arm is the value of k that is the best. The idea is that we want a small SSE, but that the SSE tends to decrease toward 0 as we increase k (the SSE is 0 when k is equal to the number of data points in the dataset, because then each data point is its own cluster, and there is no error between it and the center of its cluster).So our goal is to choose a small value of k that still has a low SSE, and the elbow usually represents where we start to have diminishing returns by increasing k. Figure 5 showsAn elbow chart showing the SSE after running k-means clustering for k going from 1 to 15. We see a pretty clear elbow at k = 4, indicating that 4 is the best number of clusters.3(AutoEncoder-based) Customer Segmentation In addition to examine the model performance by linear transformation using PCA, dimensionality reduction based on Artificial Neural Network (ANN) is also considered. The result can be used to compare the performance of the autoencoder and PCA for the data reduction tasks.Autoencoders are generally a type of artificial neural network used for learning a representation or encoding of a set of unlabeled data as a first step towards reducing dimensionality by compressing (encoding) the input data matrix to a reduced set of dimensions and then reconstructing (decoding) the compressed data back to their original form. then the compressed data may be used in different applications such as clustering as our case.An autoencoder can learn nonlinear transformations using a nonlinear activation function and multiple hidden layers by taking an input layer (data matrix X), which is passed through hidden layers where their dimensions are lower than the input layer dimension to perform compression. The network aims to reconstruct the original data as close as possible to the original with minimum reconstruction error.Autoencoder training algorithm may be summarized in the following steps, for each input x:1.Feed-forward pass to calculate all hidden layers’ activations and store them in a memory cache-style. Simultaneously, calculate the final output x1 in the last layer.2.Measure deviation of x1 from input x using loss function.3.Backpropogate the error through the network and update the weights.4.Repeat until resulting loss is acceptable or other factor is satisfied.Step 1: Autoencoder Implementation For our experiments, we trained fully connected autoencoder as a symmetric model that is symmetrical about how the input data is compressed and decompressed by exact opposite manners.Before start training an autoencoder, it is essential to set our hyperparameters set including the model design components and training variables such as:1.Code size: the number of nodes in the middle layer. More compression results in smaller sizes.2.Number of layers: Represents the number of hidden layers needed for network training.3.Number of nodes per layer: With each subsequent layer of the encoder, the number of nodes per layer reduces and rises back into the decoder. In terms of layer structure, the decoder is symmetrical to the encoder.4.Loss function: represents the reconstruction error, either mean squared error or binary cross-entropy is used according to the input values if they are within [0, 1] then cross-entropy is typically used otherwise, the mean square error is used.5.Training variables: associated with the training process such as activation functions, learning rate, number of training epochs, batch size per epoch, as well as how often we want to display information about our training progress.We configure the Autoencoders using the Keras tensorflow with GPU support Python package, and compile the neural network with mean square error loss and Adam optimizer [50] with 0.001 learning rate and the default Keras parameters. We also set the number of training epochs to 200, While we set the batch size for the training cycle to 512.The size of our input is set to 220 neurons, which corresponds to our dataset’s number of features. We tried different hidden layer configurations as shown in Table 7; in order to minimize the reconstruction error computed by mean square error. All encoder/decoder pairs used rectified liner units (ReLU) activation function given in the following formula:$$\begin{aligned} f(x)= & {} \max {(0,x)} \end{aligned}$$(3)Table 7Experimental results for different hidden layers configurationsHidden layersTraining loss (MSE)Validation loss (MSE)[100,50,100]3.7788e−043.8346e−04[100,30,100]4.3803e−044.4357e−04[100,20,100]5.5041e−045.5685e−04[180,100,20,100,180]4.7450e−044.7450e−04[180,100,30,100,180]3.9926e−044.0332e−04[180,100,50,100,180]3.2366e−043.3278e−04Following the training of the autoencoder while reconstructing the original data by encoding and decoding samples from the validation set, allows us to monitor training by TensorBoard web interface by plotting the Reconstruction Error (MSE) or loss function as seen in Fig. 6.Step 2: Clustering the Encoded dataset By training the autoencoder, we have its encoder part learned to compress each sample into 20 floating point values. We use k-means clustering Algorithm to perform clustering. To do this, we first fit the Encoded data to K-Means clustering algorithm and then determine the optimal number of clusters using Elbow method by measuring the sum of square distances to the nearest cluster center as Fig. 7.×××××
Results and discussion
Clustering evaluation metrics
- External cluster validation: comparing the results we got from cluster analysis to an externally known result, such as externally provided class labels. Since external validation measures know the right cluster number in advance, they are mainly used for choosing an optimal clustering algorithm on a certain data set [52].
- Internal cluster validation: which exploit the internal information of the clustering process to evaluate the performance of a clustering structure without depending on external information. also It can be used to estimate the appropriate and the better clustering algorithm without any external data [52].
- Silhouette Coefficient The silhouette analysis measures how well the features are clustered and it estimates the average distance between clusters by measuring the similarity of an object is to its own cluster compared to other clusters. The values of the silhouette are ranged between − 1 and + 1, where the higher value of silhouette point out that the object is well matched to its own cluster and poorly matched to neighboring clusters and vice versa. The silhouette can be calculated with different distance metric, like the 1 Euclidean distance or the Manhattan distance. The silhouette analysis is calculated as below:$$\begin{aligned} S = \frac{1}{NC} \sum _{i} \frac{1}{ni} \sum _{x \in Ci} \frac{b(x) - a(x)}{\max [b(x),a(x)]} \end{aligned}$$(4)
- DB index The Davies–Bouldin index (DBI), an internal evaluation metric for evaluating clustering algorithms performance based on the average measure of similarity of each cluster with its most similar cluster where similarity is the ratio within-cluster distances to between-cluster distances. Lower the value of the DB index, clustering is better. It can be calculated for k number of clusters by the formulawhere Ci is the centroid of cluster i, is the average distance of all points in the cluster i to its centroid, d(Ci,Cj) is the distance between centroid Ci and Cj.$$\begin{aligned} DBI = \frac{1}{k} \sum _{i=1}^{k} \max _{i \ne j}\left( \frac{\sigma i + \sigma j}{d(Ci,Cj)} \right) \end{aligned}$$(5)