Data fusion is applied to the heterogeneous data based on the newly proposed overall framework for anomaly detection from heterogeneous data streams/sets. There are five types of data with three different modalities and having different dimensionality. All the data cannot be simply combined and integrated. Therefore, we introduce a data fusion technique which first analyses the abnormality in each data type separately and determines the degree of suspicious between 0 and 1 and sums up all the degrees of suspicion data afterwards. There are four steps to transform all the data. Firstly, estimate the credit card data eccentricity. Secondly, find the difference between credit card and loyalty card data. Thirdly, locate the distance of the suspicious person’s car to the store where the credit or loyalty cards were used, and, lastly, for images, we apply the accuracy of the classification based on their gender and age as a degree of confidence. After that, all the data are summed up together and divided by the number of data to get a normalised value of the level/degree of suspicion. The degree of suspicion based on the credit card data
\(\lambda _k^{\mathrm{cc}}\) is calculated based on the standardised eccentricity result. Eccentricity has been applied before to find an anomaly in credit card data. From eccentricity results, we calculate the degree of suspicion of the credit card data as follows:
$$\begin{aligned} \lambda _k^{\mathrm{cc}} =1-\frac{1}{\varepsilon _{k}} \end{aligned}$$
(6)
k denotes the data points. If the value of
\(\lambda _k^{\mathrm{cc}}\) is more than 0.9615, then we consider the specific use of the credit card to be suspicious, and if it is less than 0.9615, then it is considered normal. The abnormal value can be determined by the value of
n we set in the dataset when calculating the
\(\varepsilon \). For example in these data, we consider the value of
\(\varepsilon =26\) which corresponds to
\(5\sigma \) according to the Chebyshev inequality (Mohd Ali et al.
2016). Thus,
\(\lambda _k^{\mathrm{cc}} =1-\frac{1}{26}=0.9615\) is considered as the threshold.
$$\begin{aligned} 0\le \lambda _k^{\mathrm{cc}} <1 \end{aligned}$$
(7)
Then, we calculate the disagreement between the credit card and loyalty card data,
\(\lambda _k^{\mathrm{dis}}\). Firstly, the difference between the credit and loyalty card data is calculated. All the values are matched based on the timeline. After that, the difference is calculated to get the absolute disagreement. If the credit card and the loyalty card have the same value, then the value of
\(\delta \) is 0, which is considered as not suspicious:
$$\begin{aligned} \delta _k= & {} \parallel cc_k -lc_k \parallel \end{aligned}$$
(8)
$$\begin{aligned} \lambda _k^{dis}= & {} 1-\frac{1}{1+\delta _k^2 } \end{aligned}$$
(9)
After that, the degree of suspicion based on the distance between persons, car and the location of the store
\(\lambda _k^{\mathrm{loc}}\) is calculated. We begin with calculating the distance
d for every person’s car from every location of a store. The calculation of
\(\lambda _k^{\mathrm{loc}}\) is conducted as follows:
$$\begin{aligned} \lambda _k^{loc} =e^{-\frac{d_k^2 }{2\sigma _k^2}} \end{aligned}$$
(10)
The value of sigma (
\(\sigma \)) can be set based on the distance between a car park and the store location. We calculated the standard deviation for the distance from every store location to the car park for all trips every day and found the average. Based on this, we determined
\(\sigma =555\) m. According to Van Der Waerden and Timmermans (
2017), the normal distance between a car park and a store is between 50 and 700 m. The value we determined (555 m) is well within these limits. If the distance
d between the person’s car and the store’s location is more than 555 m, then we consider the degree of suspicion to be high.
Data containing face images are used applied to identify the gender and the age of the person who used the card. We recognise the gender and age by using pre-trained deep CNN learning and SVM classifier. As a result, we use the classification accuracy to get the value of the degree of suspicion (if the gender or age do not match the true ones) based on the images. If both the age and the gender based on the images are the same as the true ones, then we calculate as below:
$$\begin{aligned} \lambda _k^{\mathrm{gen}} \hbox { or } \lambda _k^{\mathrm{age}} = 1 - \hbox {classification accuracy} \end{aligned}$$
(11)
To accommodate the uncertainty due to the classifier error. If the images have a difference of the gender or age, then we only take the classification accuracy value for the same reason.
$$\begin{aligned} \lambda _k^{\mathrm{gen}} \hbox { or } \lambda _k^{\mathrm{age}} = \hbox {classification accuracy} \end{aligned}$$
(12)
The final step is a fusion of all partial degrees of suspicion based on the partial data,
\(\lambda _k^{\mathrm{total}}\). We sum all the values
\(\lambda _k^i\) and multiply by weights. The weights can be set by experts based on the importance of the specific type of data. By default, weights can be set to 1:
$$\begin{aligned} w_i =1, \quad i=\forall \end{aligned}$$
Then, the sum of all values are divided into the number of types of data,
N to normalise .
\(\lambda _k^{\mathrm{total}}\) is defined as follows:
$$\begin{aligned} \lambda _k^{\mathrm{total}}= & {} \frac{\sum _{i=1}^N {w_i \lambda _i } }{\sum _{i=1} {w_i}} \end{aligned}$$
(13)
$$\begin{aligned} \lambda _k^{\mathrm{total}}= & {} \frac{w^{\mathrm{cc}}\lambda _k^{\mathrm{cc}} +w^{\mathrm{dis}}\lambda _k^{\mathrm{dis}} +w^{\mathrm{loc}}\lambda _k^{\mathrm{loc}} +w^{\mathrm{gen}}\lambda _k^{\mathrm{gen}} +w^{\mathrm{age}}\lambda _k^{\mathrm{age}}}{{\mathop {\sum }\limits _{i=1}} {w_i}}\nonumber \\ \end{aligned}$$
(14)
The data fusion technique we used represents a weighted average. The main novelty is that we consider heterogeneous data, and each variable represents a different type of data. All the data are already transformed into values between 0 and 1 to make them comparable. Then, we rank into a descending order of the
\(\lambda _k^{\mathrm{total}}\). The highest value will be the most suspicious data. The lowest may be the least suspicious data. A human expert can determine whether a case is suspicious or non- suspicious by selecting a threshold value. If there is a need to have a line between suspicious and non-suspicious, then 3 sigma or Chebyshev inequality may apply to the result. At the end of the day, we do not intend (and hardly someone will allow or be happy with) a fully automatic system that resolves investigations. The aim is rather to simplify the role of the humans, to facilitate, so that they can focus on a small number of cases to be looked in more detail. The proposed approach does simplify the way of processing such huge amount of data, and later this method can assist the human expert in their investigation and making the final decisions.