Skip to main content

2014 | Buch

Data Mining and Knowledge Discovery for Big Data

Methodologies, Challenge and Opportunities

herausgegeben von: Wesley W. Chu

Verlag: Springer Berlin Heidelberg

Buchreihe : Studies in Big Data

insite
SUCHEN

Über dieses Buch

The field of data mining has made significant and far-reaching advances over the past three decades. Because of its potential power for solving complex problems, data mining has been successfully applied to diverse areas such as business, engineering, social media, and biological science. Many of these applications search for patterns in complex structural information. In biomedicine for example, modeling complex biological systems requires linking knowledge across many levels of science, from genes to disease. Further, the data characteristics of the problems have also grown from static to dynamic and spatiotemporal, complete to incomplete, and centralized to distributed, and grow in their scope and size (this is known as big data). The effective integration of big data for decision-making also requires privacy preservation.

The contributions to this monograph summarize the advances of data mining in the respective fields. This volume consists of nine chapters that address subjects ranging from mining data from opinion, spatiotemporal databases, discriminative subgraph patterns, path knowledge discovery, social media, and privacy issues to the subject of computation reduction via binary matrix factorization.

Inhaltsverzeichnis

Frontmatter
Aspect and Entity Extraction for Opinion Mining
Abstract
Opinion mining or sentiment analysis is the computational study of people’s opinions, appraisals, attitudes, and emotions toward entities such as products, services, organizations, individuals, events, and their different aspects. It has been an active research area in natural language processing and Web mining in recent years. Researchers have studied opinion mining at the document, sentence and aspect levels. Aspect-level (called aspect-based opinion mining) is often desired in practical applications as it provides the detailed opinions or sentiments about different aspects of entities and entities themselves, which are usually required for action. Aspect extraction and entity extraction are thus two core tasks of aspect-based opinion mining. In this chapter, we provide a broad overview of the tasks and the current state-of-the-art extraction techniques.
Lei Zhang, Bing Liu
Mining Periodicity from Dynamic and Incomplete Spatiotemporal Data
Abstract
As spatiotemporal data becomes widely available, mining and understanding such data have gained a lot of attention recently. Among all important patterns, periodicity is arguably the most frequently happening one for moving objects. Finding periodic behaviors is essential to understanding the activities of objects, and to predict future movements and detect anomalies in trajectories. However, periodic behaviors in spatiotemporal data could be complicated, involvingmultiple interleaving periods, partial time span, and spatiotemporal noises and outliers. Even worse, due to the limitations of positioning technology or its various kinds of deployments, real movement data is often highly incomplete and sparse. In this chapter, we discuss existing techniques to mine periodic behaviors from spatiotemporal data, with a focus on tackling the aforementioned difficulties risen in real applications. In particular, we first review the traditional time-series method for periodicity detection. Then, a novelmethod specifically designed to mine periodic behaviors in spatiotemporal data, Periodica, is introduced. Periodica proposes to use reference spots to observe movement and detect periodicity from the in-and-out binary sequence. Then, we discuss the important issue of dealing with sparse and incomplete observations in spatiotemporal data, and propose a new general framework Periodo to detect periodicity for temporal events despite such nuisances.We provide experiment results on real movement data to verify the effectiveness of the proposed methods. While these techniques are developed in the context of spatiotemporal data mining, we believe that they are very general and could benefit researchers and practitioners from other related fields.
Zhenhui Li, Jiawei Han
Spatio-temporal Data Mining for Climate Data: Advances, Challenges, and Opportunities
Abstract
Our planet is experiencing simultaneous changes in global population, urbanization, and climate. These changes, along with the rapid growth of climate data and increasing popularity of data mining techniques may lead to the conclusion that the time is ripe for data mining to spur major innovations in climate science. However, climate data bring forth unique challenges that are unfamiliar to the traditional data mining literature, and unless they are addressed, data mining will not have the same powerful impact that it has had on fields such as biology or e-commerce. In this chapter, we refer to spatio-temporal data mining (STDM) as a collection of methods that mine the data’s spatio-temporal context to increase an algorithm’s accuracy, scalability, or interpretability (relative to non-space-time aware algorithms).We highlight some of the singular characteristics and challenges STDM faces within climate data and their applications, and provide the reader with an overview of the advances in STDM and related climate applications. We also demonstrate some of the concepts introduced in the chapter’s earlier sections with a real-world STDM pattern mining application to identify mesoscale ocean eddies from satellite data. The case-study provides the reader with concrete examples of challenges faced when mining climate data and how effectively analyzing the data’s spatio-temporal context may improve existing methods’ accuracy, interpretability, and scalability. We end the chapter with a discussion of notable opportunities for STDM research within climate.
James H. Faghmous, Vipin Kumar
Mining Discriminative Subgraph Patterns from Structural Data
Abstract
Many scientific applications search for patterns in complex structural information; when this structural information is represented as graphs, a powerful tool is efficiently mining discriminative subgraphs. For example, the structures of chemical compounds can be stored as graphs, and with the help of discriminative subgraphs, chemists can predict which compounds are potentially toxic; 3D protein structures can be stored as graphs, and with the help of discriminative subgraphs, pharmacologists can predict which proteins are able to bind certain ligands and which are not; program flow information can be represented as graphs and with the help of discriminative subgraphs, computer scientists can identify program bugs and predict which program flows are successful and which are not. Many research studies have been devoted to developing efficient discriminative subgraph pattern mining algorithms. Higher efficiency allows users to process larger graph datasets and higher effectiveness enables users to achieve better results in applications. In this chapter, we introduce several existing discriminative subgraph pattern mining algorithms, including LEAP, CORK, graphSig, COM, GAIA and LTS. We evaluate the algorithms with real protein and chemical structure data.
Ning Jin, Wei Wang
Path Knowledge Discovery: Multilevel Text Mining as a Methodology for Phenomics
Abstract
Transdisciplinary research is a rapidly expanding part of science and engineering, demanding newmethods for connecting results across fields. In biomedicine for example, modeling complex biological systems requires linking knowledge acrossmulti-level of science, fromgenes to disease. Themove to multilevel research requires new strategies; in this discussion we present path knowledge discovery, a novel methodology for linking published research findings.
The development of path knowledge discovery was motivated by problems in neuropsychiatry, where researchers need to discover interrelationships extending across brain biology that link genotype (such as dopamine gene mutations) to phenotype (observable characteristics of organisms such as cognitive performance measures). To advance an understanding of the complex bases of neuropsychiatric diseases, researchers need to search and discover relations among the many manifestations of these diseases across multiple biological and behavioral levels (i.e., genotypes and phenotypes at levels from molecular expression through complex syndromes). Phenomics – the study of phenotypes on a genome-wide scale – requires close collaboration among specialists in multiple fields. We developed a computer-aided path knowledge discovery methodology to accomplish this goal.
Path knowledge discovery consists of two integral tasks: 1) association path mining among concepts in multipart phenotypes that cross disciplines, and 2) finegranularity knowledge-based content retrieval along the path(s) to permit deeper analysis. Implementing this methodology with our PhenoMining tools has required development of innovative measures of association strength for pairwise associations, as well as the strength for sequences of associations, in addition to powerful lexicon-based association expansion to increase the scope of matching. In our discussions we describe the validation of the methodology using a published heritability study from cognition research, and we obtain comparable results. We show how PhenoMining tools can greatly reduce a domain expert’s time (by several orders of magnitude) when searching and gathering knowledge from the published literature, and can facilitate derivation of interpretable results.
We built these PhenoMining tools on an existing knowledge base (PhenoWiki.org), now called PhenoWiki+, which can greatly speed up the knowledge acquisition process. Further, using the Resource Description Framework (RDF) data model in the PhenoWiki knowledge repository allows us to connect with different knowledge sources to enlarge the knowledge scope. The knowledge base also supports annotation, an important capability for collaborative knowledge discovery.
Chen Liu, Wesley W. Chu, Fred Sabb, D. Stott Parker, Robert Bilder
InfoSearch: A Social Search Engine
Abstract
The staggering growth of online social networking platforms has also propelled information sharing among users in the network. This has helped develop the user-to-content link structure in addition to the already present user-to-user link structure. These two data structures has provided us with a wealth of dataset that can be exploited to develop a social search engine and significantly improve our search for relevant information. Every user in a social networking platform has their own unique view of the network. Given this, the aim of a social search engine is to analyze the relationship shared between friends of an individual user and the information shared to compute the most socially relevant result set for a search query.
In this work, we present InfoSearch: a social search engine.We focus on how we can retrieve and rank information shared by the direct friend of a user in a social search engine. We ask the question, within the boundary of only one hop in a social network topology, how can we rank the results shared by friends. We develop InfoSearch over the Facebook platform to leverage information shared by users in Facebook. We provide a comprehensive study of factors that may have a potential impact on social search engine results. We identify six different ranking factors and invite users to carry out search queries through InfoSearch. The ranking factors are: ‘diversity’, ‘degree’, ‘betweenness centrality’, ‘closeness centrality’, ‘clustering coefficient’ and ‘time’. In addition to the InfoSearch interface, we also conduct user studies to analyze the impact of ranking factors on the social value of result sets.
Prantik Bhattacharyya, Shyhtsun Felix Wu
Social Media in Disaster Relief
Usage Patterns, Data Mining Tools, and Current Research Directions
Abstract
As social media has become more integrated into peoples’ daily lives, its users have begun turning to it in times of distress. People use Twitter, Facebook, YouTube, and other social media platforms to broadcast their needs, propagate rumors and news, and stay abreast of evolving crisis situations. Disaster relief organizations have begun to craft their efforts around pulling data about where aid is needed from social media and broadcasting their own needs and perceptions of the situation. They have begun deploying new software platforms to better analyze incoming data from social media, as well as to deploy new technologies to specifically harvest messages from disaster situations.
Peter M. Landwehr, Kathleen M. Carley
A Generalized Approach for Social Network Integration and Analysis with Privacy Preservation
Abstract
Social network analysis is very useful in discovering the embedded knowledge in social network structures, which is applicable in many practical domains including homeland security, publish safety, epidemiology, public health, electronic commerce, marketing, and social science. However, social network data is usually distributed and no single organization is able to capture the global social network. For example, a law enforcement unit in Region A has the criminal social network data of her region; similarly, another law enforcement unit in Region B has another criminal social network data of Region B. Unfortunately, due the privacy concerns, these law enforcement units may not be allowed to share the data, and therefore, neither of them can benefit by analyzing the integrated social network that combines the data from the social networks in Region A and Region B. In this chapter, we discuss aspects of sharing the insensitive and generalized information of social networks to support social network analysis while preserving the privacy at the same time. We discuss the generalization approach to construct a generalized social network in which only insensitive and generalized information is shared. We will also discuss the integration of the generalized information and how it can satisfy a prescribed level of privacy leakage tolerance which is measured independently to the privacy-preserving techniques.
Chris Yang, Bhavani Thuraisingham
A Clustering Approach to Constrained Binary Matrix Factorization
Abstract
In general, binary matrix factorization (BMF) refers to the problem of finding two binary matrices of low rank such that the difference between their matrix product and a given binary matrix is minimal. BMF has served as an important tool in dimension reduction for high-dimensional data sets with binary attributes and has been successfully employed in numerous applications. In the existing literature on BMF, the matrix product is not required to be binary. We call this unconstrained BMF (UBMF) and similarly constrained BMF (CBMF) if the matrix product is required to be binary. In this paper, we first introduce two specific variants of CBMF and discuss their relation to other dimensional reduction models such as UBMF. Then we propose alternating update procedures for CBMF. In every iteration of the proposed procedure, we solve a specific binary linear programming (BLP) problem to update the involved matrix argument. We explore the relationship between the BLP subproblem and clustering to develop an effective 2- approximation algorithm for CBMF when the underlying matrix has very low rank. The proposed algorithm can also provide a 2-approximation to rank-1 UBMF. We also develop a randomized algorithm for CBMF and estimate the approximation ratio of the solution obtained. Numerical experiments show that the proposed algorithm for UBMF finds better solutions in less CPU time than several other algorithms in the literature, and the solution obtained from CBMF is very close to that of UBMF.
Peng Jiang, Jiming Peng, Michael Heath, Rui Yang
Erratum: Data Mining and Knowledge Discovery for Big Data
Abstract
In the original online version of this volume, the foreword is missing.
Wesley W. Chu
Backmatter
Metadaten
Titel
Data Mining and Knowledge Discovery for Big Data
herausgegeben von
Wesley W. Chu
Copyright-Jahr
2014
Verlag
Springer Berlin Heidelberg
Electronic ISBN
978-3-642-40837-3
Print ISBN
978-3-642-40836-6
DOI
https://doi.org/10.1007/978-3-642-40837-3