Abstract

Currently, we are experiencing a rapid growth of the number of social-based online systems. The availability of the vast amounts of data gathered in those systems brings new challenges that we face when trying to analyse it. One of the intensively researched topics is the prediction of social connections between users. Although a lot of effort has been made to develop new prediction approaches, the existing methods are not comprehensively analysed. In this paper we investigate the correlation between network metrics and accuracy of different prediction methods. We selected six time-stamped real-world social networks and ten most widely used link prediction methods. The results of the experiments show that the performance of some methods has a strong correlation with certain network metrics. We managed to distinguish “prediction friendly” networks, for which most of the prediction methods give good performance, as well as “prediction unfriendly” networks, for which most of the methods result in high prediction error. Correlation analysis between network metrics and prediction accuracy of prediction methods may form the basis of a metalearning system where based on network characteristics it will be able to recommend the right prediction method for a given network.

1. Introduction

Network structures have been studied for many years. First research in this area can be traced back to 1736 when Euler defined and solved the Seven Bridges problem of Königsberg [1]. Since then, for a long time, networks have been mainly studied by mathematicians and this resulted in a very prominent research field known today as the graph theory. There was not much ground breaking development in the complex network research area until 1960s, when the Erdos-Renyi random graph model (ER-model) was introduced [2, 3]. This is the simplest model of complex network. Due to the fact that there was a lack of large real-world data, most of the work had been done on theoretical analysis of phenomena existing in networked structures (e.g., phase transition).

Over the years data collection techniques have significantly improved our ability to store massive and heterogenous network data. During the time when ER-model was introduced, progress has also been made by sociologists in researching real-world human relationships [4, 5]. A new wave of research was set off by Watts and Strogatz who published a paper about the small-world effect in 1998 [6] and introduction of the scale-free network model by Barabási and Albert one year later [7].

As the accessibility of database systems and Internet is growing, more and more real-world network datasets are available. The available information about people and their activities is much richer and more complex than ever before. The complex network concept is an abstract form of various real-world networks, for example, biological networks such as protein-protein interaction networks, metabolic networks [8, 9], human networks and disease spread [1013], scientific collaboration networks [14, 15], and online social networks [1619].

Link prediction in complex network is one of the popular research topics. Most of the researchers focus on the link prediction problem [20] which is very valuable for solving real-world problems. Generally, the prediction problem is mainly studied from two angles: (i) network structure and (ii) attributes of nodes and connections. Structure refers to the way in which nodes that compose the network are interconnected. It reflects the information about network topology. Majority of the progress in the area of structure based prediction has been made by mathematicians and physicists. Some of the well-known structure based prediction methods are Common Neighbour, Jaccard’s Index, Adamic/Adar Index, Katz, and so forth (for a review of the methods please see [21]).

The link prediction problem also has been studied from the angle of the network attribute information. The attribute information refers to description of the features of nodes. Such information is difficult to show directly in the network graph. It can be for example, done by labelling nodes; for example, 1 depicts node that represents woman and 2 means that node represents man. The majority of attribute-based prediction methods follow a machine learning approach; that is, they use classification-based methods to make predictions. Widely used methods include Decision Tree, Support Vector Machine (SVM), and Naïve Bayes [22, 23]. In [2426], authors report that the performance of link prediction improves when machine learning approaches are used. However, this is done using additional network information that is not always available. We would like to emphasize that, in our work, we are interested in the methods that only require the basic network structure information and thus we do not include machine learning methods in our study.

However, although much effort has been made, there is still no prominent prediction method that could provide a satisfactory performance. Thus, there is still a huge research gap that needs to be addressed.

1.1. Research Motivation

In the realm of network prediction, many efforts have been done on exploring new prediction methods that could provide better performance. However, the methods presented in most of the studies only improve the prediction result significantly for the network used in the study. There is a lack of systematic research that would enable to reveal the reason why the methods are good predictors when it comes to some of the networks but very bad when other networks are considered.

This paper addresses this problem, by exploring the correlation between network metrics and prediction accuracy of different methods. We expect that such approach will enable to find the reasons why methods performance varies on different networks. Apart from having a further understanding of the prediction methods, the study is also important as a theoretical base for developing new prediction methods. This could be relevant to many subjects. The prediction methods could help to find the relationships between proteins which might not be easily observed directly due to the interaction complexity. For example, new interactions can be inferred from the existing known interaction networks [27, 28] which shows a much better performance than prediction purely by chance. Online market targeting might also benefit from the network prediction which has already been applied in real-world industries. For example, Google and Amazon recommend customers the potential goods and services that they might be interested in which is a kind of link prediction that predicts the link between customers and products.

Beyond that, analysis of the link prediction problem in a time series approach could help researchers gain a better understanding of the evolution of the networks. Many works have been done to study the dynamics of complex network [2931]. The achievement of network prediction analysis could help explain the mechanism of the network evolution.

1.2. Contributions

The main contribution of our study is that we look at the link prediction as a time series problem and systematically analysed the correlation between network metrics and methods accuracy. In addition, in our experiments, we also find that for some networks, most of the prediction methods could provide a good performance while for some other networks, most methods are relatively powerless. We name them “prediction friendly” networks and “prediction unfriendly” networks, respectively.

The paper is structured as follows. Section 2 presents the prediction methods and performance metrics used in our experiments. Section 3 presents how the dataset were selected and processed. In Sections 4 and 5 we introduce the experimental design and present obtained results. We conclude the paper in Section 6.

Link prediction problem has been extensively studied by members of the complex network community. Liben-Nowell and Kleinberg have formalised the link prediction problem in [20] in the following way.

Let be a network within the time period of where represents the set of nodes and represents the set of links. For the next time period , the network might change. The link prediction focuses on how to predict the evolution of links, that is, how will differ from .

Researchers with background in physics and mathematics usually deal with the problem by focusing on the topology information of the networks. Researchers with machine learning and data mining background favour to solve the problem with considering the nodes’ attribute information. There are three types of link prediction problems as shown in Figure 1: we can consider (i) only adding links to the existing network, (ii) only removing links from the existing structure, and (iii) both, adding and removing links at the same time.

Adding Links. Adding links (Figure 1(a)) means that in the next time window a new link will be created between existing nodes. There can be one or more newly created links.

Removing Links. Removing links (Figure 1(b)) means that the link will disappear in the next time window. Similar to the situation when new links are added, one or more links can be removed in one time step.

Adding and Removing Links. This problem is the combination of two previously described problems. It means that from one time window to another both appearance and disappearance of links can be predicted (Figure 1(c)).

In this research, we will only focus on the first type of link prediction problem which only aims at predicting the appearance of links. The main reason for this is that the vast majority of existing methods for real-world data focus on this problem, so it means that we have big enough base to perform correlation analysis.

2.1. Prediction Methods

We select and present a brief description of ten commonly used prediction methods that use topology information about networks in the prediction process. Throughout this section the symbols , denote nodes, denotes number of nodes in the network, and is the average degree. and denote the neighbour sets of these nodes and and denote the degree number of node and , respectively.

Common Neighbours. This method is based on the assumption that two nodes with many common neighbours will be connected in the future. The more common neighbours the two users have, the higher the probability that a relationship between them will emerge. As a basic and intuitive method, Common Neighbours approach is usually used as a baseline to judge the performance of other methods [17, 20, 21, 40]. The complexity of this method, as introduced in [41], is .

Jaccard’s Coefficient. The Jaccard Coefficient, also known as Jaccard index or Jaccard similarity coefficient, is a statistic measure used for comparing similarity of sample sets. It is usually denoted as where and represent two different nodes in a network. In link prediction, all the neighbours of a node are treated as a set and the prediction is done by computing and ranking the similarity of the neighbour set of each node pair. This method is based on Common Neighbours method and its complexity is also . The mathematical expression of this method is as follows [20]:

Preferential Attachment. Due to the assumption that the node with high degree is more likely to get new links [42], preferential attachment was introduced as a prediction method. The degree of both nodes in a pair needs to be considered for the prediction. Same as common neighbours, this is also a basic prediction method which is usually used as a baseline to measure the performance of other prediction methods. This method will calculate similarity score for each pair of nodes within the network rather than only the neighbour of nodes; thus the complexity of preferential attachment is . This method can be expressed as

Adamic/Adar Index. It was initially designed to measure the relation between personal home pages. As shown in 4, the more friends has, the lower score it will be assigned to. Thus, the common neighbour of a pair of nodes with few neighbours contributes more to the Adamic/Adar score (AA) value than this with large number of relationships. In real-world social network, it can be interpreted as follows: if a common acquaintance of two people has more friends, then it is less likely that he will introduce the two people to each other than in the case when he has only few friends. It shows good results in predicting the friendship according to personal homepage and Wikipedia Collaboration Graph, but in the experiment of predicting author collaboration, it shows a poor accuracy prediction [16]. It is another method that is based on common neighbour; the complexity is also the . It is calculated as where is a common neighbour of node and node .

. This method takes lengths of all paths between each pair of nodes into consideration [43]. According to 5, the number of paths between node and node with length (written as ) is calculated and then multiplied by a factor . By summing up all the results for the given two nodes with path length from to , a prediction score for the pair of nodes is obtained. Katz is a prediction method based on the topology of whole network and thus its calculation is more complex than other methods in this section. The complexity is mainly determined by the matrix inversion operator, which is [41, 44]:

The parameter , as shown in 5, is used to adjust the weight of path with different length. When an extremely small is chosen, the longer paths will contribute less to the score in comparison to shorter ones so that the result will be close to the common neighbours.

It is one of the prediction methods that, as it will be shown in further sections, achieves high prediction accuracy in many experiments.

Cosine Similarity. The idea of this method is based on the dot product of two vectors. It is often used to compare documents in text mining [21]. In network prediction problem, this method is expressed as For each pair of nodes with common neighbours, this method will perform a vector multiplication and thus the complexity is .

Sørensen Index. This index [45] is designed for comparing the similarity of two samples and originally used in analysis plant sociology. The complexity of this method is . It is defined as

Hub Promoted Index. HPI is proposed for analysing metabolic networks as shown in [46]. The property of this index is that the links adjacent to hubs are likely to obtain a higher similarity score. The complexity of the method is . It is expressed as

Hub Depressed Index. Approach that uses the idea of hub in totally different manner than HPI is Hub Depressed Index (HDI). It gives links adjacent to hub a lower score. Its complexity is the same as Hub Promoted Index, . It is defined as

Leicht-Holme-Newman Index. LHNI [47] was proposed to quantify the similarity of nodes in networks. It is based on the concept that two nodes are similar if their immediate neighbours in the network are themselves similar. As another common neighbour based method, its complexity is . It is defined as

All of the methods presented in this section are following similar approach. The required input for each method is the adjacency matrix that represents a network in which there are only 0 and 1 (0 when there is no link between two given nodes and 1 when the links between two given nodes exist). The output of each method is a similarity matrix in which each element represents the similarity score of a pair of nodes within the network and it is calculated according to the equation used in a given method.

2.2. Prediction Performance Metrics

In order to measure the performance of a prediction method, we need to use historical network data. Link prediction is a time related activity; therefore, we should use time-stamped dataset, and according to the time stamp, separate the data into two sets, as training set for prediction methods and as unknown future network for testing where . Those two networks must consist of the same set of nodes . The number of possible links that is denoted by is . The link prediction method, in principle, provides a similarity score for each nonexisting link () and for most methods, a higher score means higher likelihood that the link will appear in the future. Final prediction is done by ordering this score list and selecting top links with the highest score.

In our work, AUC is used for quantifying the accuracy of prediction method. It is the area under the receiver operating characteristic curve [48]. In the context of network link prediction, AUC can be interpreted as the probability that a randomly chosen missing link () is given a higher similarity score than a randomly chosen pair of unconnected links () [49]. The algorithmic implementation of AUC follows the approach in [21]. It is calculated as where is the number of times that we randomly pick a pair of links from missing links set and unconnected links set; is the number of times that the missing link got a higher score than unconnected link, while is the number of times when they are equal. The AUC value will be 0.5 if the score is generated from an independent and identical distribution. Thus, the degree to which the AUC exceeds 0.5 indicates how much better the predictions are when compared to prediction by chance.

3. Data Preparation

All six datasets used in experiments are real-world social networks, five of them come from Koblenz Network Collection (KONECT [50]) and another one from the Wrocław University of Technology (see Table 1).

3.1. Dataset Selection

Datasets for the experiments have to meet certain requirements: (i) they have to represent data about users’ interactions or any other type of activity that enables to define connections between users and (ii) those activities have to be time stamped. As described in Section 2, the link prediction problem is a time series problem that looks into the evolution of networks in time. Time-stamp is thus necessary. Table 1 shows the original dataset information that was selected based on these two criteria.

3.2. Data Processing

To make the data suitable for the experiments, first the preprocessing of datasets has been performed. It consists of the following three steps.(1)Select Data Samples. For each dataset, we first randomly select 6000–8000 user records (8000 samples are selected due to the calculation capacity. As for some dense networks, 8000 nodes are also too big, so we choose 6000) from the original dataset as the sample user data. As UC Irvine Messages only contain 1899 users, so we leave them as they are. The specific sample numbers are shown in Table 2.(2)Split the Data into Training and Testing Sets. Prediction in a time series problem means the dataset should be divided into train and test sets based on time stamps available. As the dataset of Flickr and YouTube are collected by taking snapshot of the network which is different from other four datasets, we take the first day snapshot as the training set and the remaining data as the test set. The other four networks are split according to the time scale with a ratio of approximate training time: test time = 80% : 20% as shown in Table 2.(3)Extract Connected Network. Dividing data into training and testing sets can cause the isolation of some nodes or cliques. This, in turn, generates noise for measuring the accuracy of prediction methods as the methods we selected can not predict unconnected nodes. To eliminate the impact of this noise, we extract the giant component from training dataset as our final training set . The final test set is obtained by extracting the network with all the nodes that exist in from the original test set obtained from step (2). For nodes existing in the final training set but not present in the original test set, we just keep and leave them isolated in the final test set as it is formed by link disappearing.After all, we get the train set and test set as described in Section 2.2 where both sets have the same nodes .

4. Experimental Design

In order to be able to apply all selected methods and taking into account the types of datasets available, the network is represented as a binary unweighted network. This enables us to reach a consistent and comprehensive review of the existing methods.

First, the prediction methods described in Section 2.1 will be applied to each of the processed training sets to get the similarity matrix as the prediction result. The prediction results will be then evaluated using the testing set and the AUC for each method will be calculated.

For the implementation of those methods, we applied the toolbox that is presented in [21] and all the experiments were implemented in Matlab.

As stated before, the main goal of the research is to explore the correlations between the accuracy of different prediction methods and network metrics. For the training set of each network, the network metrics are calculated with toolboxes provided by KONECT [50] and MIT Strategic Engineering research group. The metrics we calculate include the following.

Global Clustering Coefficient. It is defined in [51] as It shows the transitivity of the network as a whole. The coefficient range is between 0 and 1.

Average Clustering Coefficient [6]. It is based on local clustering . For each vertex , its local clustering coefficient can be calculated by and then the ACC can be calculated as where is the number of nodes in a network.

Network Density. The ratio between number of existing links and number all possible links within a given network. where where is the number of nodes in the network.

Gini Coefficient [52]. I the network theory Gini coefficient is defined as where is the sorted list of degrees in the network and is the number of nodes in a network. Its value is between 0 and 1, where 0 denotes total equality between degrees and 1 denotes dominance of single node.

Diameter. It is the longest path out of the set of all shortest paths in the network. where is the shortest path between node and .

Average Shortest Path. The average number of the shortest paths between each pair of vertices is Once the accuracy of prediction for each method and the metrics for each network are calculated, the correlation between them will be analysed. The Pearson’s Coefficient [53] is used to measure the correlation between accuracy of network prediction method and selected network metrics. It is a widely used statistic method to measure linear correlation between two variables, say and . It is calculated as

The coefficient value is between and where means that two variables are negatively linearly correlated and means that they are positively linearly correlated.

5. Experiment Result

5.1. Network Profiles

The values of network metrics for each of the extracted social networks are presented in Table 5. As it is much easier to set up relationship between people in online social network than in real-world network, the average shortest paths in our experiments are all smaller than six, the number suggested by the six degrees of separation theory [54]. The average shortest path of the six selected networks is 3.65. This reflects the small-world property of the networks. People are closer to each other in online social networks than in face-to-face networks. This phenomenon was also pointed out in [55] where authors established that the average shortest path of Twitter is 3.43.

The degree distributions of the six networks, shown in Figure 3, indicate that they are scale-free networks as the distributions follow the power law.

We also compared the GCC and ASP metrics of the real network with the theoretical metrics of random network and regular network that have the same number of nodes and links. The analytical formulas for GCC and ASP in random and regular networks with a given number of nodes and links are given in Table 4. The results of calculations for each analysed network are presented in Table 3.

Figure 2 plots the metrics of six analysed networks and related theoretical networks, respectively. It shows that the clustering coefficients of the analysed networks are all between random and regular networks. Meanwhile, the average shortest paths of real-world networks are all very close to the random networks. This two phenomena indicate the small-world property of analysed structures. Taking into account both metrics and node degree distribution, it can be concluded that those networks are a combination of small-world and scale-free networks.

5.2. Prediction Results

The prediction results are summarised in Table 6. Katz method achieved the best average performance and the overall performance is ranked as Katz > Preferential Attachment > Adamic-Adar > Common Neighbours > Cosine Similarity > Jaccard Index > Hub Depressed Index > Hub Promoted Index > Sørensen > Leicht-Holme-Newman Index. By comparing the variance of each method, we find that the Katz also provides the most stable prediction performance among those methods while Common Neighbours is the worst performing approach. Overall, we find that Katz and preferential attachment provide good prediction accuracy together with a relative stability.

To study the prediction results from the perspective of each network please see Figure 4. The prediction results of different methods align on the vertical lines for each network, respectively. From this figure, we find that, for some networks, most of the prediction methods could provide a good prediction result. Such networks include Flickr, Enron, and YouTube. We call this type of networks the “prediction friendly” network. Apart from this type of network, there are also some networks for which most of the prediction approaches provide fairly low accuracy, such as Facebook, UC Irvine, and PWr. Similarly, we call those networks “prediction unfriendly” networks. Please note that, in the experiments, for both prediction friendly and unfriendly networks, always provides a good performance level.

5.3. Correlation between Prediction Accuracy and Network Metrics

Table 7 shows Pearson’s linear correlation coefficient of prediction accuracy and network metrics. The closer the absolute value to 1, the higher the correlation between analysed factors. Figure 5 presents a heat-map plot to show the degree of linear relation between the two factors where we use the absolute value of Pearson’s Coefficient. The brighter the colour in the heat-map is, the stronger a given network metric and the accuracy of prediction method are correlated.

In Figure 5, we can see that the preferential attachment and Gini Coefficient provide the highest correlation coefficient (0.94) which indicates that they generally follow a linear relationship. This is not a surprise. For a network with a high Gini Coefficient, there exist some nodes with dominant high degrees. It just reflects the phenomenon of “rich get richer” which is also the assumption of preferential attachment method. So we can say that preferential attachment could lead to a high Gini Coefficient and thus Preferential Attachment, on the other hand, could also describe how a network with high Gini Coefficient evolves by giving a better prediction result.

Cosine-GCC and Sor-GCC also provide a correlation coefficient above 0.8. We can draw the conclusion that Cosine Similarity and Sorensen Index method perform better in a network with higher GCC than they do in networks with smaller GCC.

The diameter and average shortest path shows a negative linear relation to almost all of the prediction methods (excluding Katz and LHN where the negative correlation is weak). Both the average shortest path and the network diameter reflect how easy it is to get from one node in a network to another one. Shorter path as well as smaller diameter means a higher probability that a pair of randomly picked nodes will be connected. Negative correlation between those two metrics and prediction accuracies of different methods means that most of the methods work well in the situations where networks feature short ASP and in consequence small Diameter. This is additionally supported by the fact that global clustering coefficient is positively correlated with those of the prediction methods meaning that these methods work well with networks with high clustering coefficient. Based on the above we can say that prediction methods positively correlated with GCC and negatively with ASP and diameter will work well in the situation where analysed network is of small-world type. In the same time they will work neither in random networks where GCC is very low nor in regular networks where ASP is very long.

It should be clear that the Pearson’s Coefficient does not indicate the accuracy of the method. For example, although the prediction method Katz does not show strong correlation to any of the network metrics, it still provides the best result in our experiments. The reason can be found in Table 6, where it is shown that Katz always provides a high prediction accuracy regardless of the tested network metrics.

The most important value of our correlation study lies in the variety of prediction methods used in the experiments. The prediction with methods combination could be a way to improve accuracy and this will be investigated in the future. The correlation between methods and network metrics could be used to determine the weight of different prediction methods in the combination process.

5.4. Prediction Friendly and Unfriendly Networks

Table 7 also shows the average correlation of network metrics and prediction accuracy. As we know the closer the absolute value of correlation to 1, the stronger the linear relation. Here we take 0.5 as a threshold for strong correlation. According to this, we find that there are four metrics strongly correlated with the prediction accuracy which includes GCC, ACC, Diameter, and ASP. So it is reasonable to assume that these metrics could be used to classify the prediction friendly and unfriendly networks. We ranked each of the analysed networks according to the metrics that have strong correlation with prediction accuracy and based on this for each network we calculate the average ranking (Table 8). Top three ranked networks (with the small average ranks) are the prediction friendly networks and the other three are prediction unfriendly networks. It can be seen that the prediction friendly networks usually have large global and local clustering coefficient, a short average shortest path, and small diameter. It suggests that networks with the structural profile similar to small-world network are easier to predict than networks similar to random structures.

6. Conclusions

In this research, we look into the correlation between ten prediction methods and different network metrics in six time-stamped social networks. The study of network metrics confirmed that the node degree distribution of real-world social networks follows a power law distribution. We also found that the average shortest path of online social network is much smaller than six. This might be due to the fact that online relationships are much easier to setup. The results of the prediction accuracy show that the best method among the tested ones is . It is also the most stable technique from all tested ones. preferential attachment is the second best method that also provides a good prediction accuracy. In addition, for some “prediction friendly” networks, most of prediction methods could provide a good performance while for some others, called in here as “prediction unfriendly” networks, most prediction methods are lack of power.

The Pearson correlation coefficient enabled us to investigate the relationship between network metrics and prediction accuracy. Our research showed that some methods are highly correlated with certain network metrics (e.g., PA-Gini, Sor-GCC, and Cosine-Gcc).

There are several further directions of the presented study. As discovered, for some networks, most prediction methods could provide a good performance which we name them as “prediction friendly networks.” Similarly, we also find the existence of “prediction unfriendly” networks. Section 5.4 explores the prediction friendly and unfriendly network classification according to the metrics ranking. The problem is that it does not provide an exact threshold that could be used to classify networks. It is out of scope of this research but is a very interesting topic for another study that we plan to conduct.

Based on the results of correlation between network metrics and the prediction accuracy, another possible work is to develop a new prediction approach which combines several existing methods. We can also extend this research to many other networks, not only social ones, which might be good for finding some more general relations.

Conflict of Interests

The authors declare that there is no conflict of interests regarding the publication of this paper.