Skip to main content

2010 | Buch

Mining of Data with Complex Structures

verfasst von: Fedja Hadzic, Henry Tan, Tharam S. Dillon

Verlag: Springer Berlin Heidelberg

Buchreihe : Studies in Computational Intelligence

insite
SUCHEN

Über dieses Buch

Mining of Data with Complex Structures:

- Clarifies the type and nature of data with complex structure including sequences, trees and graphs

- Provides a detailed background of the state-of-the-art of sequence mining, tree mining and graph mining.

- Defines the essential aspects of the tree mining problem: subtree types, support definitions, constraints.

- Outlines the implementation issues one needs to consider when developing tree mining algorithms (enumeration strategies, data structures, etc.)

- Details the Tree Model Guided (TMG) approach for tree mining and provides the mathematical model for the worst case estimate of complexity of mining ordered induced and embedded subtrees.

- Explains the mechanism of the TMG framework for mining ordered/unordered induced/embedded and distance-constrained embedded subtrees.

- Provides a detailed comparison of the different tree mining approaches highlighting the characteristics and benefits of each approach.

- Overviews the implications and potential applications of tree mining in general knowledge management related tasks, and uses Web, health and bioinformatics related applications as case studies.

- Details the extension of the TMG framework for sequence mining

- Provides an overview of the future research direction with respect to technical extensions and application areas

The primary audience is 3rd year, 4th year undergraduate students, Masters and PhD students and academics. The book can be used for both teaching and research. The secondary audiences are practitioners in industry, business, commerce, government and consortiums, alliances and partnerships to learn how to introduce and efficiently make use of the techniques for mining of data with complex structures into their applications. The scope of the book is both theoretical and practical and as such it will reach a broad market both within academia and industry. In addition, its subject matter is a rapidly emerging field that is critical for efficient analysis of knowledge stored in various domains.

Inhaltsverzeichnis

Frontmatter
Introduction
Abstract
Large amounts of data are collected and stored by different government, industrial, commercial or scientific organizations. As the complexity and volume of the data continue to increase, the task of classifying new unseen data and extracting useful knowledge from the data is becoming practically impossible for humans to do. This makes the automatic knowledge acquisition process not just advantageous over manual knowledge acquisition, but rather a necessity since the probability that a human observer will detect something new and useful is very low given the overwhelming complexity and volume of the information.
Fedja Hadzic, Henry Tan, Tharam S. Dillon
Tree Mining Problem
Abstract
As mentioned in Chapter 1, an important category of complex data is tree-structured data. It occurs in a variety of different domains and applications such as Web Intelligence applications, bioinformatics, natural language processing, programming compilation, scientific knowledge management and querying, etc. (Wang et al. 1994). Mining of tree-structured data introduces significant new challenges which are the subject of this chapter.
Fedja Hadzic, Henry Tan, Tharam S. Dillon
Algorithm Development Issues
Abstract
In this chapter we discuss the main issues that arise when developing tree mining algorithms. The way that an algorithm is implemented often greatly determines its efficiency. The main aspects which affect the overall performance of the algorithm are the way that the document structure is represented at the algorithm level, the way that candidate subtrees are enumerated and counted and, in the case of unordered subtree mining, the canonical form ordering schemes.
Fedja Hadzic, Henry Tan, Tharam S. Dillon
Tree Model Guided Framework
Abstract
In this chapter, we describe the main characteristics of the Tree Model Guided (TMG) Framework for frequent subtree mining. This framework has good extendibility to all of the current problems for frequent subtree mining (Hadzic 2008; Tan 2008). An algorithm is considered as extendible in the sense that minimal effort is required to adjust the general framework so that different but related problems can be solved. Furthermore, the results presented in works such as (Tan et al. 2005; 2006a, 2008, Hadzic et al. 2007, 2010) indicate that it currently exhibits the best or comparable performance among the current state-of-the-art methods. The TMG framework is also conceptually simple to understand, especially with respect to the small adjustments required to address different sub-problems within the tree mining field. The remainder of the algorithm development issues are addressed in such a way as to accommodate the most efficient execution of the TMG candidate generation. Hence, as mentioned in the previous chapter, the important aspects that need to be taken into account in addition to the candidate enumeration strategy are: tree representation, representative data structures and their operational use, and the frequency counting of generated candidate subtrees. As mentioned in Chapter 3, in the tree mining field a string-like representation is the most popular representation because each item in the string can be accessed in O(1) time, it is space efficient and easy to manipulate. In our framework, we utilize the depth-first or pre-order string encoding as described in Chapter 3. The problem of candidate subtree enumeration is to efficiently extract a complete and non-redundant set of subtrees from a given document tree. We explain the TMG approach to candidate subtree enumeration in Section 4.2. As the name implies, the enumeration phase is guided by the tree model of the document in order to generate only valid candidate subtrees. This tree model corresponds to the underlying structure of the document and a subtree is considered valid by conforming to it.
Fedja Hadzic, Henry Tan, Tharam S. Dillon
TMG Framework for Mining Ordered Subtrees
Abstract
In this chapter, we will elaborate on the overall TMG framework for mining ordered subtrees as described in Chapter 4 (Tan 2008). In an ordered tree, for each internal node, the order of its children is fixed. Such trees have found many useful applications in areas such as vision, natural language processing, molecular biology, programming compilation etc. (Wang, Zhang, Jeong & Shasha 1994). In the research on automatic natural language processing, the dictionary definitions are represented syntactically as trees. Computational linguists extract semantic information about these definitions and in the process construct semantic taxonomies (Chodorow & Klavans, 1990; Neff, Roy & Omneya 1998). In the molecular biology field, large amounts of analyzed RNA structures are collected and stored in the form of ordered labeled trees. When the researchers want to acquire information about a new RNA structure, it is compared against those already in the database in order to detect structural similarities and thereby relate different RNA structures (Shapiro & Zhang 1990). Since the researchers will maintain the RNA-related information in the same order, a comparison of ordered subtrees is sufficient. A general observation about applications where ordered subtree mining is suitable is that the left-to-right order among sibling nodes is commonly fixed and known beforehand. Ordered subtree mining is useful for querying a single database where the order restriction can be placed on the query subtree because it is known beforehand.
Fedja Hadzic, Henry Tan, Tharam S. Dillon
TMG Framework for Mining Unordered Subtrees
Abstract
This chapter describes the extension of the TMG framework for the mining of unordered induced/embedded subtrees. While in online tree-structured documents such as XML the information is presented in a particular order, in many applications the order among the sibling-nodes is considered unimportant or irrelevant to the task and is often not available. If one is interested in comparing different document structures, or the document is composed of data from several heterogeneous sources, it is very common for the order of sibling nodes to differ, although the information contained in the structure is essentially the same. In these cases, mining of unordered subtrees is much more suitable as a user can pose queries and does not have to worry about the order. All matching sub-structures will be returned with the difference being that the order of sibling nodes is not used as an additional candidate grouping criterion. Hence, the main difference when it comes to the mining of unordered subtrees is that the order of sibling nodes of a subtree can be exchanged and the resulting tree is still considered the same.
Fedja Hadzic, Henry Tan, Tharam S. Dillon
Mining Distance-Constrained Embedded Subtrees
Abstract
For certain applications, the distance between the nodes in a hierarchical structure could be considered important and two embedded subtrees with different distance relationships among the nodes need to be considered as separate entities. The embedded subtrees extracted using the traditional definition are incapable of being further distinguished based upon the node distance within that subtree. In this chapter, we describe the extension of the general TMG framework, to enable the mining of distance-constrained embedded subtrees, (Hadzic 2008; Tan 2008). In such subtrees, the distances of the nodes relative to the root of the subtree need to be taken into account during the candidate enumeration phase. The distances of nodes relative to the root (node depth) of a particular subtree will need to be stored and used as an additional equality criterion for grouping the enumerated candidate subtrees. In Chapter 9, we will illustrate scenarios and applications where the mining of distance-constrained embedded subtrees would be preferable to mining of traditional embedded subtrees, since the extracted subtree patterns will be more informative. We also highlight the importance of distance-constrained subtree mining in the context of web log mining, where the web logs are represented in tree-structured form. In what follows, we will discuss the importance of distance-constrained embedded subtrees from a more general perspective and relate it to some previous work on extracting tree-structured queries.
Fedja Hadzic, Henry Tan, Tharam S. Dillon
Mining Maximal and Closed Frequent Subtrees
Abstract
In general, for frequent pattern mining problems, the candidate enumeration process exhaustively enumerates all possible combinations of itemsets that are a subset of a given database. This process is known to be very expensive since, in many circumstances, the number of candidates to enumerate is quite large, and also the frequent patterns present in real-world data can be fairly long (Bayardo 1998). Efficient techniques attacking different issues and problems relevant to the enumeration problem are therefore highly sought after. In addition to the enumeration problem, another important problem of frequent pattern mining is to efficiently count and prune away any itemsets discovered to be infrequent. Due to the large number of candidates that can be generated from the vast amount of data, an efficient and scalable counting approach is critically important. Another problem when extracting all frequent subtrees from a complex tree database, is that the number of patterns presented to the user can be very large, thereby making the results hard to analyze and gain insights from.
Fedja Hadzic, Henry Tan, Tharam S. Dillon
Tree Mining Applications
Abstract
The aim of this chapter is to discuss the applications of tree mining algorithms with respect to the general knowledge analysis task, as well as some specific applications. The implications of using different tree mining parameters (i.e. subtree types, support definitions, constraints) are discussed, and illustrative scenarios are used to indicate useful application areas for different parameters. This kind of overview will be particularly useful for those not so familiar with the area of tree mining as it can reveal useful applications within their domain of interest. It gives guidance as to which type of tree mining will be most useful for their particular application.
Fedja Hadzic, Henry Tan, Tharam S. Dillon
Extension of TMG Framework for Mining Frequent Subsequences
Abstract
Data mining strives to find frequent patterns in data. The type of data, structures and patterns to be found characterizes a task in data mining. Advancements in data collection technologies have contributed to the high volumes of sales data. Sales data typically consists of transaction timestamps and the list of items bought by customers. If one is interested in inter-transactional patterns, sales data is a good example of a sequential pattern. In addition, sequential patterns exist in data such as DNA string databases, time series, occurrences of recurrent illness, etc.
Fedja Hadzic, Henry Tan, Tharam S. Dillon
Graph Mining
Abstract
The contents of the book have focused so far on the mining of data where the underlying structure is characterized by special types of graphs where cycles are not allowed, i.e. acyclic graphs or trees. The focus of this chapter is on the frequent pattern mining problem where the underlying structure of the data can be of general graph type where cycles are allowed. These kinds of representations allow one to model complex aspects of the domain such as chemical compounds, networks, the Web, bioinformatics, etc. Generally speaking, graphs have many undesirable theoretical properties with respect to algorithmic complexity. In the graph mining problem, the common requirement is the systematic enumeration of sub-graphs from a given graph, known as the frequent subgraph mining problem. From the available graph analysis methods, we will narrow our focus to this problem as it is the prerequisite for the detection of interesting associations among graph-structured data objects, and has many important applications. For an extensive overview of graph mining in a general context, including different laws, data generators and algorithms, please refer to (Chakrabati & Faloutsos 2006; Washio & Motoda 2003, Han & Kamber 2006). Due to the existence of cycles in a graph, the frequent subgraph mining problem is much more complex than the frequent subtree mining problem. Even though theoretically it is an NP complete problem, in practice, a number of approaches are very applicable to the analysis of real-world graph data. We will look at a number of different approaches to the frequent subgraph mining problem and a number of approaches for the analysis of graph data in general.
Fedja Hadzic, Henry Tan, Tharam S. Dillon
New Research Directions
Abstract
This chapter will discuss some new research directions in the frequent subtree mining field. This will be discussed from both the application and technical perspectives. Since frequent subtree mining (FSM) is a relatively new field compared with frequent itemset/sequence mining, many lessons can be learned form the more mature research in frequent itemset/sequence mining. A drawback of frequent pattern mining in general is that often, for a set support threshold, the number of frequent patterns becomes quite large due to some characteristics of the database. This may cause not only algorithm complexity problems, but also significant delays in the analysis and interpretation of the results. Many of the patterns may not be useful for the application at hand and/or are redundant, or not of interest to the user. Furthermore, it is also not always clear what support threshold is satisfactory for obtaining reasonable results. These are all important research areas, with some significant achievements in complexity reduction from the algorithmic and application perspectives. Some of these or similar ideas can, to a certain extent, already be applied in the FSM field, but others will need refinements and extensions to be flexible enough to cope with the additional structural properties of the data. In Section 12.2, we highlight some of the work in frequent itemset/sequence mining where the same or similar idea can be applied and prove useful in the FSM field. At the end of Section 12.2, we look at some work that has already been initiated in frequent pattern filtering and the incorporation of application-oriented constraints.
Fedja Hadzic, Henry Tan, Tharam S. Dillon
Metadaten
Titel
Mining of Data with Complex Structures
verfasst von
Fedja Hadzic
Henry Tan
Tharam S. Dillon
Copyright-Jahr
2010
Verlag
Springer Berlin Heidelberg
Electronic ISBN
978-3-642-17557-2
Print ISBN
978-3-642-17556-5
DOI
https://doi.org/10.1007/978-3-642-17557-2