2 Data Mining, the GUHA Method and LISp-Miner Software
Data mining refers to computer-based methods to find from large data masses relevant dependencies and features that are useful for the owner of the data. Typically, the data analyzed by data mining methods is, for example, measurements of the industrial process, extracts from the customer database or as in this study, systematically reported data on the road traffic accidents. In most cases, the computation algorithms used in data mining include various kinds of clustering methods, correlations, neural networks, self-organizing maps.
In the successful exploitation of data mining, the most essential is the comprehensive understanding of data and its various quantities. Even a mere innovative approach to data visualization, for example, can help to see the benefits of the given data from a completely new perspective.
The GUHA method, introduced in 1966 (see [
2]), is one of the oldest data mining methods. GUHA is an abbreviation for General Unary Hypoteses Automation. According to its name, the aim in the GUHA method is not to test hypotheses on a given data, but rather to produce them under certain logical principles and user-defined, guiding guidelines for possible further research. GUHA can be used to study large data matrices. In practice, by presently available software, the examined data matrix may have dozens or a few hundreds of columns and hundreds of thousands of rows, and any single cell in the data matrix may contain any symbols. There is no need to assume any possible statistical distribution on the data. The strength of the GUHA method is its logical basis, the formalism to express statements in first-order monadic logic with generalized quantifiers and truth and falsehood of the sentences being defined in finite models.
An essential part of the GUHA logic language is generalized quantifiers; they can be used to find answers to questions about data containing expressions like ‘almost all,’ ‘most,’ ‘significantly different subset,’ ‘quasi-equivalent subsets,’ etc. So, say, ‘Does the road accident data contain information on some special circumstances in which there has been exceptionally many fatal accidents?’ or ‘Is heavy vehicle involvement in the accident significantly dependent on some specific factors?’. With GUHA method, it is possible to find all the dependencies that the data supports; thus, it allows the user to get answers to the question ‘Does my data contain some information not even occurred to me to look for?’.
The GUHA method was initially developed as a purely theoretical concept, but later it has been implemented in different kinds of computer software, from which the most extensive and significant is the LISp-Miner software developed at the University of Economics, Prague since 1996, and is freely down loadable [
3]. LISp-Miner software is based on number crunching; in theory, the software tests all the possible alternatives that can be billions depending on the size of the data and the complexity of the task, but in practice—thanks to the logical criteria—only a few proms of all options have to be really computed. To our knowledge, general computational complexity issues of the GUHA method are, however, unsolved research problems.
The results generated by the LISp-Miner software are mostly local in nature, with the example of ‘Collision with a deer occurs to middle-aged men significantly more frequent in the early afternoon than before noon’ rather than global like ‘The proportion of road entrenchments decreases linearly as the driver’s age increases.’
Unlike Bayesian reasoning approach [
4] or a neural network-based data mining [
5], the GUHA method is not a black box method; the user must have some kind of preconception of the data being extracted. The use of the LISp-Miner software requires pre-processing of the data to be extracted prior to mining; for example, the data must be victimized, the user decides how small subdivisions of data are still worth exploring, how variables are divided into categories, and what kind of generalized quantifiers should be used; the LISp-Miner software has dozens of opportunities, for example, to model expressions like ‘
most,’ ‘
often follows,’ ‘
almost equivalent sets,’ ‘
much larger than the average.’
The key part of the GUHA method is
analytical questions. They are natural language expressions related to the data being studied. Examples of analytical questions related to the road traffic accident data are ‘How do accidents involving a young driver differ from other accidents?’ or ‘What is typical for accidents involving a heavy vehicle?’. Analytical questions can be formalized in the GUHA language and then analyzed with LISp-Miner software. Formally, GUHA logic formulas are of the form
\(\phi \approx \psi \), where
\(\phi \) and
\(\psi \) are logic combinations of variables of the data (corresponding to the columns of the data matrix) and
\(\approx \) is a generalized quantifier. In practice, the LISp-Miner software generates fourfold contingency tables that are then examined in the light of the user’s criteria. A fourfold contingency table has the form:
\(\phi \)
|
a
|
b
|
\(a+b = r\)
|
\(\lnot \phi \)
|
c
|
d
|
\(c+d = s\)
|
|
\(a+c=k\)
|
\(b+d=l\)
|
m
|
where
\(a + b + c + d = m\), the number of rows in the data matrix, and
-
a is the number of objects satisfying both \(\phi \) and \(\psi \),
-
b is the number of objects satisfying \(\phi \) but not \(\psi \),
-
c is the number of objects not satisfying \(\phi \) but satisfying \(\psi \),
-
d is the number of objects not satisfying \(\phi \) nor \(\psi \).
At present there are 21 quantifiers implemented to LISp-Miner [
6]. In general, there are various types of generalized quantifiers in GUHA theory formalizing various kinds of associations:
-
Implicational Quantifiers formalize the association Many
\(\phi \)
are
\(\psi \), they do not depend on the values c, d.
-
Comparative Quantifiers formalize the association \(\phi \)
makes
\(\psi \)
more likely than
\(\lnot \psi \).
-
Some quantifiers just express observations on the data, and some others serve as tests of statistical hypotheses on unknown probabilities.
-
Some quantifiers are symmetric: \(\phi \approx \psi \) implies \(\psi \approx \phi \), and some admit negation: \(\phi \approx \psi \) implies \(\lnot \phi \approx \lnot \psi \).
An advantage of GUHA is that new quantifiers can be defined and their properties can be analyzed in a well-established logic framework. LISp-Miner software has currently six different computing procedures. We used three of them in this study: 4ft-Miner, SD4ft-Miner and Ac4ft-Miner. 4ft-Miner procedure analyzes two formulas, the Antecedent (denoted by
\(\phi \)) and the Succedent (denoted by
\(\psi \)); as said they are combinations of columns of the data matrix. If the given truth criteria are met, by logic terminology
\(v(\phi (x) \approx \psi (x)) = \mathtt{TRUE}\), the software prints the result as a found
hypothesis. If the criteria are not met, the phrase is false and will not be printed. Thus, the truth
\(\mathtt{TRUE}\) and falsehood
\(\mathtt{FALSE}\) of a logic formula
\(\phi \approx \psi \) is based on the values in the fourfold contingency table. For example, if
\(\approx \) is a
founded p-implication quantifier, where
\(p \in (0,1]\) and denoted by
\(\Rightarrow _{p,{\hbox {Base}}}\), we define
\(v(\phi (x) \Rightarrow _{p,{\hbox {Base}}}\psi (x)) = \mathtt{TRUE}\) if
\(\frac{a}{a+b} \ge p\) and
\(a \ge {\hbox {Base}}\) and
\(v(\phi (x) \Rightarrow _{p,{\hbox {Base}}\psi (x)}) = \mathtt{FALSE}\) elsewhere. If
$$\begin{aligned} v(\phi (x) \Rightarrow _{p,{\hbox {Base}}}\psi (x)) = \mathtt{TRUE}, \end{aligned}$$
then at least
\(p\%\) of objects (rows in the data matrix) satisfying
\(\phi \) satisfy also
\(\psi \), in other words, founded
p-implication quantifiers are related to the truth of the statement
$$\begin{aligned} \phi (x) {\hbox { implies }} \psi (x) {\hbox { with confidence }} p {\hbox { and support }} {\hbox {Base}}. \end{aligned}$$
Another example is
Fisher quantifiers corresponding to the test of hypothesis
$$\begin{aligned} {\hbox {Probability}}(\phi (x) | \psi (x)) > {\hbox {Probability}}(\phi (x) | \lnot \psi (x)) {\hbox { with significance }} \alpha \end{aligned}$$
is defined to be
\(\mathtt{TRUE}\) in the data (or supported in the data) if
\(ad > bc\) and
$$\begin{aligned} \sum _{i=0}^{\hbox {min}\{b,c\}}\frac{r!s!k!l!}{m!(a+1)!(b-1)!(c-1)!(d+1)!}\le \alpha . \end{aligned}$$
and
\(\mathtt{FALSE}\) elsewhere. Useful quantifiers are also the
Above average quantifiers
\(\sim _{q}^{+}\), where
\(q \ge 0\); if
\(v(\phi \sim _{q}^{+} \psi ) = \mathtt{TRUE}\) then among objects (rows in the data matrix) satisfying
\(\phi \) there are at least
\(q\%\) more objects satisfying
\(\psi \) than in the whole data matrix. In other words, the presence of
\(\phi \) makes the presence of
\(\psi \) at least
\(1+q\) times more frequent. Obviously,
$$\begin{aligned} v(\phi \sim _{q}^{+} \psi ) = \mathtt{TRUE}\,\mathrm{iff }\frac{a}{a+b} \ge (1+q)\frac{a+c}{a + b + c + d}. \end{aligned}$$
These quantifiers and many others are implemented in 4ft-Miner, a procedure of LISp-Miner. More diverse and complex data mining tasks can be carried out by SD4ft-Miner and Ac4ft-Miner, the other procedures of LISp-Miner. SD4ft-Miner and Ac4ft-Miner procedures analyze four formulas. The truth and falsehood of associations corresponding to the related quantifiers are based on two contingency tables
\(\phi _{1}\)
|
\(a_{1}\)
|
\(b_{1}\)
|
\(a_{1}+b_{1} = r_{1}\)
|
\(\lnot \phi _{1}\)
|
\(c_{1}\)
|
\(d_{1}\)
|
\(c_{1}+d_{1} = s_{1}\)
|
|
\(a_{1}+c_{1}=k_{1}\)
|
\(b_{1}+d_{1}=l_{1}\)
|
m
|
\(\phi _{2}\)
|
\(a_{2}\)
|
\(b_{2}\)
|
\(a_{2}+b_{2} = r_{2}\)
|
\(\lnot \phi _{2}\)
|
\(c_{2}\)
|
\(d_{2}\)
|
\(c_{2}+d_{2} = s_{2}\)
|
|
\(a_{2}+c_{2}=k_{2}\)
|
\(b_{2}+d_{2}=l_{2}\)
|
m
|
As an example, by
$$\begin{aligned} \left| \frac{a_{1}}{a_{1}+b_{1}} - \frac{a_{2}}{a_{2}+b_{2}}\right| \ge p \in (0,1], a_{1} \ge {\hbox {Base}}_{1} {\hbox { and }} a_{2} \ge {\hbox {Base}}_{2} \end{aligned}$$
we define the truth that two subsets in the data matrix differ considerably from each other with respect to some property.
By the following two examples, we briefly introduce the use of the 4ft-Miner and Ac4ft-Miners software, more detailed information can be found, e.g., in [
7]. An analytical question ‘Are some types of accidents particularly typical for female drivers?’ can be analyze with the 4ft-Miner procedure through the following steps
1.
Set on the Antecedent part \(\phi \) all the variables (corresponding data matrix’s columns) whose impact on accidents is to be investigated. In addition, attach certain parameters, e.g., how many variables are to be included, whether they associated with an ‘and’ or ‘or’ -connective, etc.
2.
Set as Succedent \(\psi \) the variable ‘Female driver,’ which corresponds to one of the data columns, and retain the desired parameters.
3.
Decide how small subsets are still considered relevant; the Base is in this case it is fixed to 100.
4.
Choose the generalized quantifier; here it is founded p-implication with \(p = 0.6\). This means that if there is a sufficiently large accumulation of accidents (i.e., \(a \ge 100\)) involving at least 60% of a female drivers; \(\frac{a}{a+b} \ge 0.6\), then LISp-Miner prints this result as a hypothesis.
Then computing can be started. In this example, 4ft-Miner goes through more than 45 million fourfold contingency tables and produces four cases where the criteria are met. On of them is the following hypothesis ‘The data includes 166 accidents on icy road that occurred in the morning or day time and involved a 36- to 55-year-old driver. 102 drivers out of 166 were women.’ Given that women were involved in about 25.5% of all over 80,000 road accidents, but in this case 61.5%, it should be assumed that even more experienced female drivers are more vulnerable to serious accidents in daylight on frozen roads. In general, it is up to the user to evaluate whether the results are significant or not. Moreover, there is also a module in 4ft-Miner that tests afterward in Bayesian sense the probability of the truth of hypothesis produced by 4ft-Miner software (see [
8]).
More complex than the 4ft-Miner procedure is the SD4ft-Miner procedure (SD is an abbreviation of set differs from set), which searches for distinct subdivisions of the data, and the Ac4ft-Miner procedure, which searches for subdivisions that are similar to some variables, but are clearly different for some other variables (Ac in an abbreviation of action). These data mining tools are based on a comparison of two fourfold contingency tables with respect to user-defined parameters. As an example, assume we want to investigate whether there are some types of accidents involving drivers of different ages that different in frequency at least 25%. The analytical question would then be ‘Are there some types of accidents that are much more frequent is some age group than in some other age group?’. After setting the parameters and running Ac4ft-Miner, we obtain several hypothesis; one of the results is ‘For drivers under 25 years of age, there were far more derailing from the road accidents in winter on certain types of roads than in comparable conditions for 45–64-year-old drivers.’ In these circumstances, a total of 3988 accidents occurred for young drivers, of which 2298 (58%) were derailing from the road accidents, while the corresponding figures for older drivers were 4071 and 1226 (30%). Therefore, it can be said that in these circumstances, derailing from the road accidents was more typical for younger than old drivers.
6 Summary and Conclusion
In this study, we investigated the suitability of the GUHA data mining method and the LISp-Miner software to extract relevant information from a big data matrix in general, and in particular a
\(100 \times 83{,}500\) data matrix related to road traffic accidents that occurred in Finland in 2004–2008. GUHA is a first-order monadic logic-based data mining method; the main concepts are generalized quantifiers and analytical questions set by the user; the analytical questions can be expressed by GUHA language. The truth or falsehood of a single association or relation between two logic expressions in a data is verified by one or two contingency tables. The strength of the GUHA method is a robust logical and mathematical basis. New ideas and ways of extracting information from large data matrices can be defined in the GUHA language, explore their features, and ultimately implement them in software. GUHA has evolved over 50 years and is constantly evolving. However, a practical data miner does not need to master all the theoretical basis of GUHA in-depth. LISp-Miner is a software [
3] implementing GUHA; according to the GUHA principles, LISp-Miner goes through contingency tables and verifies or falsifies relation between logic expressions, Boolean in nature. More precisely, the patterns the LISp-Miner deals with contain several derived Boolean attributes (from two in the 4ft-Miner to five in the SD4ft-Miner). Moreover, the procedures described in Section 14.9 in [
6] deal also with categorical attributes not transformed into Boolean ones. However, we have not used these latter procedures in this study.
The road traffic accident data matrix covered more than 83,500 rows (each corresponding to a road traffic accident) and more than 100 columns (factors related to the accidents). A single PC would have needed over 300 days to compute all the required billions of contingency tables, but utilizing TUT’s grid system, which networks up to about 400 desktop computers at TUT campus [
9], computation took only a fraction of that time. In total, the computation yielded more than 10,000 associations called hypotheses in GUHA language, of which 10 easily comprehensible are reported in this report.
The GUHA method produces hypotheses such as ‘The risk of colliding with a moose by car (a very typical road traffic accident type in Finland)in the Oulu region is triple compared to the risk in the Häme region.’ However, the GUHA method does not explain the deeper connections behind the hypotheses; there might be relatively more moose in the Oulu region than in the Häme region. All this depends on the investigator’s interpretation and may require further studies. Naturally, GUHA also produces trivial, well-known results. Some of these are included in this report as it is intended to demonstrate that if there is any significant dependence on the data, LISp-Miner software finds it.
As a result of the study, we conclude that the GUHA method and the LISp-Miner software can be used to extract from a road traffic accident type data dependencies that cannot be found by other data mining methods. For example, in a study conducted at the University of Jyväskylä, no such detailed dependencies were found on the same data as this report presents [
1]. On the other hand, this is understandable since GUHA is a unique method and its findings cannot be directly compared to those of other data mining methods. Although the current interface of the LISp-Miner software is a bit embossed, it can be recommended in analyzing large data alongside other data mining algorithms, especially when it comes to studying dependencies between relatively small, but still significant subsets of a big data repository.