Analyzing Data Quality Using Data Auditor

Monitoring databases maintain configuration and measurement tables about computer systems, such as networks and computing clusters, and serve important business functions, such as troubleshooting customer problems, analyzing equipment failures, planning system upgrades, etc. These databases are prone to many data quality issues: configuration tables may be incorrect due to data entry errors, while measurement tables may be affected by incorrect, missing, duplicate and delayed polls.

We describe Data Auditor, a system for analyzing data quality and exploring data semantics of monitoring databases. Given a user-supplied constraint, such as a boolean predicate expected to be satisfied by every tuple, a functional dependency, or an inclusion dependency, Data Auditor computes "pattern tableaux", which are concise summaries of subsets of the data that satisfy or fail the constraint. We discuss the architecture of Data Auditor, including the supported types of constraints and the tableau generation mechanism. We also show the utility of our approach on an operational network monitoring database.

Divesh Srivastava

Rebuilding the World from Views

With the ever-increasing growth of the internet, more and more data sets are being made available. Most of this data has its origin in the real world, often describing the same objects or events from different viewpoints. One can thus consider data sets obtained from different sources as different (and possibly inconsistent) views of our world, and it makes sense to try to integrate them in some form, e.g. to answer questions which involve data from multiple sources. While data integration is an old and well-investigated subject, the nature of the data sets to be integrated is changing. They increase in volume as well as complexity, are often undocumented, relationships between data sets are more fuzzy, and representations of the same real-word object differ. To address these challenges, new methods for rapid, semi-automatic, loose and virtual integration, exploration and querying of large families of data sets must be developed.

In an ongoing project we are investigating a framework for sampling and matching data sets in an efficient manner. In particular, we consider the problem of creating and analyzing samples of relational databases to find relationships between string-valued attributes [1]. Our focus is on identifying attribute pairs whose value sets overlap, a pre-condition for typical joins over such attributes. We deal with the issue of different representation of objects, i.e., ‘dirty’ data, by employing new similarity measures between sets of strings, which not only consider set based similarity, but also similarity between string instances. To make the measures effective, especially in light of data sets being large and distributed, we developed efficient algorithms for distributed sample creation and similarity computation. Central to this is that sampling is

synchronized

. For clean data this means that the same values are sampled for each set, if present [2,3]. For dirty data one must ensure that similar values are sampled for each set, if present, and we manage to do so in a probabilistic manner.

The next step of our research is to extend such a sampling and matching approach to multiple attributes and semi-structured data, and to construct search and query systems which make direct use of the matches discovered.

Xiaofang Zhou, Henning Köhler

Approximate Query Processing in Sensor Networks

Many emerging applications are collecting massive volumes of sensor data from networks of distributed devices, such as sensor networks and cyber-physical systems. These environments are commonly characterized by the intrinsic volatility and uncertainty of the sensor data, and the strict communication (energy) constraints of the distributed devices. Approximate query processing is an important methodology that exploits the tolerance of many applications to inaccuracies in reported data in order to reserve communication overhead. The research challenge is how to ensure communication efficiency without sacrificing result usefulness.

Many prior work depends on users to impose preferences or constraints on approximate query processing, such as result inaccuracies, candidate set size, and response time. We argue that the pre-determined user preferences may turn out to be inappropriate and become a substantial source of i) jeopardized query results, ii) prohibitive response time, and iii) overwhelming communication overhead.

Our work ‘

Probing Queries in Wireless Sensor Networks

’ (ICDCS 2008) studies a scenario where empty sets may be returned as accurate query results, yet users may benefit from approximate answer sets not exactly conforming the specified query predicates. The approximate answer sets can be used not only to answer the query approximately but also to guide users to modify their queries for further probing the monitored objects. The distance between sensing data and a query and the dominating relationship between sensing data are first defined. Then, three algorithms for processing probing queries are proposed, which compute the best approximate answer sets that consist of the sensing data with the smallest distance from given queries. All the algorithms utilize the dominating relationship to reduce the amount of data transmitted in sensor networks by filtering out the unnecessary data. Experimental results on real and synthetic data sets show that the proposed algorithms have high performance and energy efficiency.

Our work ‘

Enabling ε

-Approximate Querying in Sensor Networks

’ (VLDB 2009) studies the scenario where, due to the dynamic nature of sensor data, users are unable to determine

in advance

what error bounds can lead to affordable cost in approximate query processing. We propose a novel

ε

-approximate querying (EAQ) scheme to resolve the problem. EAQ is a uniform data access scheme underlying various queries in sensor networks. The core idea of EAQ is to introduce run-time iteration and refinement mechanisms to enable efficient,

ε

-approximate query processing in sensor networks. Thus it grants more flexibility to in-network query processing and minimizes energy consumption through communicating data up to a

just-sufficient

level. To ensure bounded overall cost for the iteration and refinement procedures of the EAQ scheme, we develop a novel

data shuffling

algorithm. The algorithm converts sensed datasets into special representations called MVA. From prefixes of MVA, we can recover approximate versions of the entire dataset, where all individual data items have guaranteed error bounds. The EAQ scheme supports efficient and flexible processing of various queries including

spatial window

query,

value range

query, and queries with QoS constraints. The effectiveness and efficiency of the EAQ scheme are evaluated in a real sensor network testbed.

Even in case the users know exactly what result inaccuracies they can tolerate, many prior query processing techniques still cannot meet arbitrary precision requirements given by users. Most notably, many aggregational query processing methods can only support fixed error bounds.

In ‘

Sampling based (ε

,

δ)-Approximate Aggregation Algorithm in Sensor Networks

’ (ICDCS 2009), we propose a uniform sampling based aggregation algorithm. We prove that for any

ε

(

ε

≤ 0) and

δ

(0 ≤

δ

≤ 1), this algorithm returns the approximate aggregation result satisfying that the probability of the relative error of the results being larger than

ε

is less than

δ

. However, this algorithm is only suitable for static network. Considering the dynamic property of sensor networks, we further proposed a Bernoulli sampling based algorithm in ‘

Bernoulli Sampling based (ε

,

δ)-Approximate Aggregation in Large-Scale Sensor Networks

’ (INFOCOM 2010). We prove that this algorithm also can meet the requirement of any precision, and is suitable for both static and dynamic networks. Besides, two sample data adaptive algorithms are also provided. One is to adapt the sample with the varying of precision requirement. The other is to adapt the sample with the varying of the sensed data in networks. The theoretical analysis and experiments show that all proposed algorithms have high performance in terms of accuracy and energy cost.

Jianzhong Li

Web Data I

Duplicate Identification in Deep Web Data Integration

Duplicate identification is a critical step in deep web data integration, and generally, this task has to be performed over multiple web databases. However, a customized matcher for two web databases often does not work well for other two ones due to various presentations and different schemas. It is not practical to build and maintain

$C^{2}_{n}$

matchers for n web databases. In this paper, we target at building one universal matcher over multiple web databases in one domain. According to our observation, the similarity on an attribute is dependent of those of some other attributes, which is ignored by existing approaches. Inspired by this, we propose a comprehensive solution for duplicate identification problem over multiple web databases. The extensive experiments over real web databases on three domains show the proposed solution is an effective way to address the duplicate identification problem over multiple web databases.

Wei Liu, Xiaofeng Meng, Jianwu Yang, Jianguo Xiao

Learning to Detect Web Spam by Genetic Programming

Web spam techniques enable some web pages or sites to achieve undeserved relevance and importance. They can seriously deteriorate search engine ranking results. Combating web spam has become one of the top challenges for web search. This paper proposes to learn a discriminating function to detect web spam by genetic programming. The evolution computation uses multi-populations composed of some small-scale individuals and combines the selected best individuals in every population to gain a possible best discriminating function. The experiments on WEBSPAM-UK2006 show that the approach can improve spam classification recall performance by 26%, F-measure performance by 11%, and accuracy performance by 4% compared with SVM.

Xiaofei Niu, Jun Ma, Qiang He, Shuaiqiang Wang, Dongmei Zhang

Semantic Annotation of Web Objects Using Constrained Conditional Random Fields

Semantic annotation of Web objects is a key problem for Web information extraction. The Web contains an abundance of useful semi-structured information about real world objects, and the empirical study shows that strong sequence characteristics exist for Web information about objects of the same type across different Web sites. Conditional Random Fields (CRFs) are the state of the art approaches taking the sequence characteristics to do better labeling. However, previous CRFs have their limitations and can not deal with a variety of logical constraints between Web object elements efficiently. This paper presents a Constrained Conditional Random Fields (Constrained CRFs) model to do semantic annotation of Web objects. The model incorporates a novel inference procedure based on integer linear programming and extends CRFs to naturally and efficiently support all kinds of logical constraints. Experimental results using a large number of real-world data collected from diverse domains show that the proposed approach can significantly improve the semantic annotation accuracy of web objects.

Yongquan Dong, Qingzhong Li, Yongqing Zheng, Xiaoyang Xu, Yongxin Zhang

Time Graph Pattern Mining for Web Analysis and Information Retrieval

Graph patterns on the web are characteristic substructures of web graphs. Graph patterns have played important roles in web analysis and information retrieval. However, temporal characteristics of the web have not been estimated sufficiently in previously proposed graph patterns. We first propose time graph patterns, and then, we mine the patterns representing topics that are discussed extensively on the web. Conducting experiments using those mined patterns, we finally ascertain that time graph pattern opens a novel way for information retrieval and web analysis reflecting temporal characteristics.

Taihei Oshino, Yasuhito Asano, Masatoshi Yoshikawa

Networked Data

FISH: A Novel Peer-to-Peer Overlay Network Based on Hyper-deBruijn

Autonomy, efficiency, robustness and load balancing are four desirable features for Peer-to-Peer (P2P) systems. These four features however, are often in conflict with each other. We present a novel P2P architecture, called FISH, based on the Hyper-deBruijn topology. FISH provides flexibility in terms of connections per node and the level of fault-tolerance, and possesses a low diameter. We further address the challenge of dynamic operations of peers by introducing a novel set of algorithms. We also design two variants of Hyper-deBruijn topology for achieving an asymptotical optimal diameter. Comprehensive experiments show that FISH has a good trade-off among the four expected features.

Ye Yuan, Guoren Wang, Yongjiao Sun

Continuous Summarization of Co-evolving Data in Large Water Distribution Network

While traditional water data stream analysis focuses mainly on single sensor node or monitoring station, having an accurate picture of the overall data patterns is more meaningful in understanding large water distribution network’s behavior and characteristics, tracking important trends, and also making informed judgments about measurement or utilization operations. In this paper, we propose a continuous summarization scheme that aims to continuously provide Representative Patterns of the complete data in large water distribution network. Our core contributions are to propose to summarize Representative Pattern for describing the spatial-temporal pattern in water distribution network and employ a parameter-free algorithm based on the Minimum Description Length (MDL) Principle to automatically split data streams into episodes for generating the series of representative patterns. Moreover, we evaluate our approaches on a real water distribution network from the Battle of the Water Sensor Network (BWSN). Experiment results show that our online summarization methods are effective, scalable and interpretable; What’s more, we discover interesting periodic time-evolving patterns on the chlorine data.

Hongmei Xiao, Xiuli Ma, Shiwei Tang, Chunhua Tian

Proactive Replication and Search for Rare Objects in Unstructured Peer-to-Peer Networks

The search efficiency problem in unstructured peer-to-peer network has not been adequately addressed so far, especially concerning search for rare objects. In this paper, we propose a proactive replication strategy to improve the search efficiency for rare objects. It uses object probing technique for peers to decide whether to establish replications for their objects or not when they join the network. This strategy can effectively increase the popularity of rare objects so as to enhance the search efficiency. We also present a rare object search algorithm. When a peer forwards a search request, forward probability is calculated according to its neighbors’ degree and the number of neighbors’ objects. Therefore, the search request is forwarded to the peers more likely containing target objects. Simulations show that the proactive replication strategy greatly improves the search efficiency for rare objects with moderate communication overhead. The rare object search algorithm not only improves search efficiency for rare objects, but also achieves load balance in search.

Guoqiang Gao, Ruixuan Li, Kunmei Wen, Xiwu Gu, Zhengding Lu

SWORDS: Improving Sensor Networks Immunity under Worm Attacks

Since sensor nodes are implemented in embedded computer systems, which do not have complicated hardware architecture and operating system. Therefore, worm attacks, which exploit buffer-overflow vulnerabilities, could compromise the entire sensor network by sending a single mal-packet. To address the severe problem, the technique of software diversity has been leveraged to defend against sensor worms. In this paper, we encompass

random sensor node scheduling

and

software diversity

to propose a novel sensor worm defense scheme. Our scheme can effectively improve the sensor network defensive capability. Then some practical considerations for its real application is proposed. Finally, analytical and simulation results confirm the effectiveness of our scheme in sensor worm defense.

Nike Gui, Ennan Zhai, Jianbin Hu, Zhong Chen

Efficient Multiple Objects-Oriented Event Detection Over RFID Data Streams

Complex event processing has been extensively applied in areas such as RFID tracking for supply chain management, fluctuation detection in stock trading, real-time intrusion detection in network monitoring,etc. Most existing research works focus on specification, formalization and evaluation of single-object oriented complex event processing. In this paper, we investigate complex event processing problems over multiple correlated RFID objects. We study multiple correlated RFID event detection problems. We present two kinds of evaluation algorithms: SEquence Join Algorithm(SEJA) and Stream Join Algorithm(SJA). Experimental studies demonstrate that our proposed algorithms are effective and scalable.

Shanglian Peng, Zhanhuai Li, Qiang Li, Qun Chen, Hailong Liu, Yanming Nie, Wei Pan

Social Networks

CW2I: Community Data Indexing for Complex Query Processing

The increasing popularity of Community Web Management Systems (CWMSs) calls for tailor-made data management approaches for them. Still, existing CWMSs have mostly focused on simple similarity-based queries; they do not provide a framework for the efficient processing of more complex queries over community web data. In this paper, we propose a two-way indexing scheme that facilitates efficient and scalable retrieval and complex query processing with community data. A thorough experimental comparison, based on real-world data and practical queries, illustrates the advantages of our scheme compared to other approaches for community web data management. Besides, our double-indexing scheme provides an attractive solution to the storage problem as well.

Mei Hui, Panagiotis Karras, Beng Chin Ooi

Clustering Coefficient Queries on Massive Dynamic Social Networks

The Clustering Coefficient (CC) is a fundamental measure in social network analysis assessing the degree to which nodes tend to cluster together. While CC computation on static graphs is well studied, emerging applications have new requirements for online query of the “global” CC of a given subset of a graph. As social networks are widely stored in databases for easy updating and accessing, computing CC of their subset becomes a time-consuming task, especially when the network grows large and cannot fit in memory. This paper presents a novel method called “Approximate Neighborhood Index (ANI)” to significantly reduce the query latency for CC computation compared to traditional SQL based database queries. A Bloom-filter-like data structure is leveraged to construct ANI in front of a relational database. Experimental results show that the proposed approach can guarantee the correctness of a CC query while significantly reducing the query latency at a reasonable memory cost.

Zhiyu Liu, Chen Wang, Qiong Zou, Huayong Wang

Predicting Best Answerers for New Questions in Community Question Answering

Community question answering (CQA) has become a very popular web service to provide a platform for people to share knowledge. In current CQA services, askers post their questions to the system and wait for answerers to answer them passively. This procedure leads to several drawbacks. Since new questions are presented to all users in the system, the askers can not expect some experts to answer their questions. Meanwhile, answerers have to visit many questions and then pick out only a small part of them to answer. To overcome those drawbacks, a probabilistic framework is proposed to predict best answerers for new questions. By tracking answerers’ answering history, interests of answerers are modeled with the mixture of the Language Model and the Latent Dirichlet Allocation model. User activity and authority information is also taken into consideration. Experimental results show the proposed method can effectively push new questions to the best answerers.

Mingrong Liu, Yicen Liu, Qing Yang

Semantic Grounding of Hybridization for Tag Recommendation

Tag recommendation for new resources is one of the most important issues discussed recently. Many existing approaches ignore text semantics and can not recommend new tags which are not in the training dataset (e.g., FolkRank). Some exceptional semantic approaches use a probabilistic latent semantic method to recommend tags in terms of topic knowledge (e.g., ACT model). However, they do not perform well because many entities in these models result in much noise. In this paper, we propose hybrid approaches in folksonomy to challenge these problems. Hybrid approaches are combination of Language Model (LM) for keyword based approach and Latent Dirichlet Allocation (LDA), Tag-Topic (TT) model and User-Tag-Topic (UTT) model for topic based approaches. Our approaches can recommend meaningful tags and can be used to discover resource implicit correlations. Experimental results on Bibsonomy dataset show that LM performs better than all other hybrid and non-hybrid approaches. Also the hybrid approaches with less number of entities (e.g., LDA with only one entity) perform better than those approaches having more entities (e.g., UTT with three entities) for tag recommendation task.

Yan’an Jin, Ruixuan Li, Yi Cai, Qing Li, Ali Daud, Yuhua Li

Rich Ontology Extraction and Wikipedia Expansion Using Language Resources

Existing social collaboration projects contain a host of conceptual knowledge, but are often only sparsely structured and hardly machine-accessible. Using the well known Wikipedia as a showcase, we propose new and improved techniques for extracting ontology data from the wiki category structure. Applications like information extraction, data classification, or consistency checking require ontologies of very high quality and with a high number of relationships. We improve upon existing approaches by finding a host of additional relevant relationships between ontology classes, leveraging multi-lingual relations between categories and semantic relations between terms.

Christian Schönberg, Helmuth Pree, Burkhard Freitag

Cloud Computing

Fine-Grained Cloud DB Damage Examination Based on Bloom Filters

More and more cloud database services have emerged recently which is economical, convenient and highly scalable. However the security concerns in these services are more prominent than ever before. The users should not rely on the Cloud Service Providers(CSPs) only to protect the integrity and retrievability of their cloud databases. We argue that it is not enough for the cloud db owners to know the existence of damage, but more importantly, they should be able to locate the damages precisely for the recovery afterwards. In this paper we provide fine-grained cloud DB damage examination methods by checking the results of sampling queries. The examination process support element/tuple level damage identification and is cost-efficient based on bloom filters. The process is also server transparent and imperceptible which prevent the CSP be inconsistent to different users.

Min Zhang, Ke Cai, Dengguo Feng

XML Structural Similarity Search Using MapReduce

XML is a de-facto standard for web data exchange and information representation. Efficient management of these large volumes of XML data brings challenges to conventional technique. To cope with large scale data, MapReduce computing framework as an efficient solution has attracted more and more attention in the database community recently. In this paper, an efficient and scalable framework is proposed for XML structural similarity search on large cluster with MapReduce. First, sub-structures of XML structure are extracted from large XML corpus located on a large cluster in parallel. Then Min-Hashing and locality sensitive hashing techniques are developed on the distributed and parallel computing framework for efficient structural similarity search processing. An empirical study on the cluster with real large datasets demonstrates the effectiveness and efficiency of our approach.

Peisen Yuan, Chaofeng Sha, Xiaoling Wang, Bin Yang, Aoying Zhou, Su Yang

Comparing Hadoop and Fat-Btree Based Access Method for Small File I/O Applications

Hadoop has been widely used in various clusters to build scalable and high performance distributed file systems. However, Hadoop distributed file system (HDFS) is designed for large file management. In case of small files applications, those metadata requests will flood the network and consume most of the memory in Namenode thus sharply hinders its performance. Therefore, many web applications do not benefit from clusters with centered metanode, like Hadoop. In this paper, we compare our Fat-Btree based data access method, which excludes center node in clusters, with Hadoop. We show their different performance in different file I/O applications.

Min Luo, Haruo Yokota

Data Mining I

Mining Contrast Inequalities in Numeric Dataset

Finding relational expressions which exist frequently in one class of data while not in the other class of data is an interesting work. In this paper, a relational expression of this kind is defined as a contrast inequality. Gene Expression Programming (GEP) is powerful to discover relations from data and express them in mathematical level. Hence, it is desirable to apply GEP to such mining task. The main contributions of this paper include: (1) introducing the concept of contrast inequality mining, (2) designing a two-genome chromosome structure to guarantee that each individual in GEP is a valid inequality, (3) proposing a new genetic mutation to improve the efficiency of evolving contrast inequalities, (4) presenting a GEP-based method to discover contrast inequalities, (5) giving an extensive performance study on real-world datasets. The experimental results show that the proposed methods are effective. Contrast inequalities with high discriminative power are discovered from the real-world datasets. Some potential works on contrast inequality mining are discussed.

Lei Duan, Jie Zuo, Tianqing Zhang, Jing Peng, Jie Gong

Users’ Book-Loan Behaviors Analysis and Knowledge Dependency Mining

Book-loan is the most important library service. Studying users’ book-loan behavior patterns can help libraries to provide more proactive services. Based on users’ book-loan history in a university library, we could build a book-borrowing network between users and books. Furthermore, users who borrow the same books are linked together. The users and links then form a co-borrowing network which can be regarded as a knowledge sharing network. Both the book-borrowing network and the co-borrowing network can be used to study users’ book-loan behavior patterns. This paper presents a study in analyzing users’ book-loan behaviors and mining knowledge dependency between schools and degrees in Peking University. The mining work is based on the book-borrowing network and its corresponding co-borrowing network. To the best of our knowledge, it is the first work to mine knowledge dependency in digital library domain.

Fei Yan, Ming Zhang, Jian Tang, Tao Sun, Zhihong Deng, Long Xiao

An Extended Predictive Model Markup Language for Data Mining

Common data mining metadata benefits sharing, exchanging and integration among data mining applications. The Predictive Model Markup Language PMML facilitates the exchange of models among data mining applications and becomes a standard of data mining metadata. However, the evolution of models and extension of products, PMML needs large number of language elements and leads to conflicts in PMML based data mining metadata inevitably. This paper presents an extended predictive model markup language EPMML for data mining, which is designed to reduce the complexity of PMML language elements. The description logic for predictive model markup language DL4PMML that belongs to the description logic family, is the formal logical foundation of EPMML and makes it possess strong semantic expression ability. We analyze how EPMML describe data mining contents in detail. Some experiments expatiate how EPMML based data mining metadata support automatically reasoning and detect inherent semantic conflicts.

Xiaodong Zhu, Jianzheng Yang

A Cross-Media Method of Stakeholder Extraction for News Contents Analysis

We are studying contents analysis of multimedia news as a solution to the issue of bias, which multimedia news, as a reflection of the real world, is also facing. For the contents analysis, we use a stakeholder model representing descriptions of different stakeholders, which are defined as the main participants in an event. In this paper, we propose a method of detecting stakeholders as the core component of the stakeholder-oriented analysis. In our work, a stakeholder is assumed to appear in the video clips and be mentioned in the closed captions frequently. Given a series of video clips and their closed captions reporting the same event, we extract stakeholder candidates from both textual and visual descriptions. After that, we calculate the degree of exposure for each candidate to identify stakeholders. We also present experimental results that validate our method.

Ling Xu, Qiang Ma, Masatoshi Yoshikawa

Stream Processing

An Efficient Approach for Mining Segment-Wise Intervention Rules in Time-Series Streams

Huge time-series stream data are collected every day from many areas, and their trends may be impacted by outside events, hence biased from its normal behavior. This phenomenon is referred as intervention. Intervention rule mining is a new research direction in data mining with great challenges. To solve these challenges, this study makes the following contributions: (a) Proposes a framework to detect intervention events in time-series streams, (b) Proposes approaches to evaluate the impact of intervention events, and (c) Conducts extensive experiments both on real data and on synthetic data. The results of the experiments show that the newly proposed methods reveal interesting knowledge and perform well with good accuracy and efficiency.

Yue Wang, Jie Zuo, Ning Yang, Lei Duan, Hong-Jun Li, Jun Zhu

Automated Recognition of Sequential Patterns in Captured Motion Streams

Motion capture data has been frequently used in computer animation and video games. Motions are often captured in a continuous manner such that a motion contains multiple patterns joined sequentially without obvious breakpoints between them. It is challenging to learn the captured motions as it requires both segmentation and recognition. In this paper, a new method based on an extension of open-end dynamic time warping (OE-DTW) is proposed to automatically segment and recognize sequential patterns in motion streams. To enhance the performance, we introduce a global constraint of

K-Repetition

on OE-DTW and a flexible end point detection scheme. In the experiments, we applied our method on different classes of dance motions and demonstrated the effectiveness of our method by comparing with existing approach.

Liqun Deng, Howard Leung, Naijie Gu, Yang Yang

Online Pattern Aggregation over RFID Data Streams

Complex events are very useful in

radio frequency identification

(

RFID

) applications. Even though many techniques have been proposed to detect RFID complex event instances online, they do not support online pattern aggregation operations. In this paper, we present an online pattern aggregation algorithm, named

RFID_PAQ

.

RFID_PAQ

divides a sliding window into several sub-windows and computes its aggregation value from those of its sub-windows. Based on the distribution characteristics of RFID data streams, we design different mapping functions to dispatch events to related sub-windows. Finally, our extensive experiments demonstrate that

RFID_PAQ

is effective and efficient.

Hailong Liu, Zhanhuai Li, Qun Chen, Shanglian Peng

Cleaning Uncertain Streams by Parallelized Probabilistic Graphical Models

Real-world applications generate uncertain streams due to unreliable equipments and/or data processing such as object identification. However, application context implies specific rules, which are critical in cleaning data and make them closer to the reality. In this paper, we propose a framework for cleaning uncertain streams by Parallelized Probabilistic Graphical Models (P

2

GM). Making full use of multi-core processing architecture, the system processes parallelized high-volume streams efficiently. With P

2

GM, users can define their own cleaning algorithms and generate specific parallelized systems. We implement a prototype of video surveillance based on P

2

GM, and demonstrate the quality and performance of our approaches experimentally.

Qian Zhang, Shan Wang, Biao Qin

Graph Processing

Taming Computational Complexity: Efficient and Parallel SimRank Optimizations on Undirected Graphs

SimRank has been considered as one of the promising link-based ranking algorithms to evaluate similarities of web documents in many modern search engines. In this paper, we investigate the optimization problem of SimRank similarity computation on undirected web graphs. We first present a novel algorithm to estimate the SimRank between vertices in

$O\left( {{n}^{3}}+K\cdot {{n}^{2}} \right)$

time, where

n

is the number of vertices, and

K

is the number of iterations. In comparison, the most efficient implementation of SimRank algorithm in [1] takes

$O\left( K\cdot {{n}^{3}} \right)$

time in the worst case. To efficiently handle large-scale computations, we also propose a parallel implementation of the SimRank algorithm on multiple processors. The experimental evaluations on both synthetic and real-life data sets demonstrate the better computational time and parallel efficiency of our proposed techniques.

Weiren Yu, Xuemin Lin, Jiajin Le

DSI: A Method for Indexing Large Graphs Using Distance Set

Recent years we have witnessed a great increase in modeling data as large graphs in multiple domains, such as XML, the semantic web, social network. In these circumstances, researchers are interested in querying the large graph like that: Given a large graph G, and a query Q, we report all the matches of Q in G. Since subgraph isomorphism checking is proved to be NP-Complete[1], it is infeasible to scan the whole large graph for answers, especially when the query’s size is also large. Hence, the ”filter-verification” approach is widely adopted. In this approach, researchers first index the neighborhood of each vertex in the large graph, then filter vertexes , and finally perform subgraph matching algorithms. Previous techniques mainly focus on efficient matching algorithms, paying little attention to indexing techniques. However, appropriate indexing techniques could help improve the efficiency of query response by generating less candidates. In this paper we investigate indexing techniques on large graphs, and propose an index structure DSI(Distance Set Index) to capture the neighborhood of each vertex. Through our distance set index, more vertexes could be pruned, resulting in a much smaller search space. Then a subgraph matching algorithm is performed in the search space. We have applied our index structure to real datasets and synthetic datasets. Extensive experiments demonstrate the efficiency and effectiveness of our indexing technique.

Yubo Kou, Yukun Li, Xiaofeng Meng

K-Radius Subgraph Comparison for RDF Data Cleansing

With the quick development of the semantic web technology, RDF data explosion has become a challenging problem. Since RDF data are always from different resources which may have overlap with each other, they could have duplicates. These duplicates may cause ambiguity and even error in reasoning. However, attentions are seldom paid to this problem. In this paper, we study the problem and give a solution, named

K-radius subgraph comparison

(KSC). The proposed method is based on RDF-Hierarchical Graph Model. KSC combines similar and comparison of

context

to detect duplicate in RDF data. Experiments on publication datasets show that the proposed method is efficient in duplicate detection of RDF data. KSC is simpler and less time-costs than other methods of graph comparison.

Hai Jin, Li Huang, Pingpeng Yuan

Query Processing

A Novel Framework for Processing Continuous Queries on Moving Objects

Traditional techniques for processing continuous queries on moving objects reduce query re-computing through single-threaded and shared execution between multiple queries, and don’t make use of the parallel computing capabilities of the ubiquitous multi-core CPUs. Thus, to explore this kind of parallelism, a

M

ulti-threading based

F

ramework for

C

ontinuous

Q

ueries (MFCQ) is proposed which adopts a strategy of re-computing all of the queries periodically. The framework divides the query process into three phases:the updating, optimization and execution stages; multi-threading based methods are used in each phase. Moreover, the framework is deemed to be general, because it is compatible with various index techniques and query algorithms. By using the framework, a query index based KNN algorithm and an object index based KNN algorithm are proposed respectively. Experimental results show that the multi-threading framework executed on the multi-core platform outperforms the traditional YPK-CNN algorithm.

Liang Zhao, Ning Jing, Luo Chen, Zhinong Zhong

Group Visible Nearest Neighbor Queries in Spatial Databases

Traditional nearest neighbor queries and its variants, such as

Group Nearest Neighbor Query

(GNN), have been widely studied by many researchers. Recently obstacles are involved in spatial queries. The existence of obstacles may affect the query results due to the visibility of query point. In this paper, we propose a new type of query,

Group Visible Nearest Neighbor Query

(GVNN), which considers both visibility and distance as constraints.

Multiple Traversing Obstacles

(MTO) Algorithm and

Traversing Obstacles Once

(TOO) Algorithm are proposed to efficiently solve GVNN problem. TOO resolves GVNN by defining the invisible region of MBR of query set to prune both data set and obstacle set, and traverses obstacle R*-tree only once. The experiments with different settings show that TOO is more efficient and scalable than MTO.

Hu Xu, Zhicheng Li, Yansheng Lu, Ke Deng, Xiaofang Zhou

iPoc: A Polar Coordinate Based Indexing Method for Nearest Neighbor Search in High Dimensional Space

K

-nearest neighbor (KNN) search in high dimensional space is essential for database applications, especially multimedia database applications, because images and audio clips are always modeled as high dimensional vectors. However, performance of existing indexing methods degrades dramatically as the dimensionality increases. In this paper, we propose a novel polar coordinate based indexing method, called iPoc, for efficient KNN search in high dimensional space. First, data space is initially partitioned into hypersphere regions, and then each hypersphere is further refined into hypersectors via hyperspherical surface clustering. After that, a series of local polar coordinate systems can be derived from hypersectors, taking advantage of the geometric characters of hypersectors. During search processing, iPoc can effectively prune query-unrelated data points by estimating the lower and upper bounds in both radial coordinate and angle coordinate. Furthermore, we design a key mapping scheme to merge keys measured by independent local polar coordinates into the global polar coordinates. Finally, the global polar coordinates are indexed by a traditional 2-dimensional spatial index, e.g., R-tree. Extensive experiments on real audio datasets and synthetic datasets confirm the effectiveness and efficiency of our proposal and prove that iPoc is more efficient than the existing high dimensional KNN search methods.

Zhang Liu, Chaokun Wang, Peng Zou, Wei Zheng, Jianmin Wang

Join Directly on Heavy-Weight Compressed Data in Column-Oriented Database

Operating directly on compressed data can decrease CPU costs. Many light-weight compressions, such as run-length encoding and bit-vector encoding, can gain this benefit easily. Heavy-Weight Lempel-Ziv (LZ) has no method to operate directly on compressed data. We proposed a join algorithm,

LZ join

, which join two relations

R

and

S

directly on compressed data when decoding. Regard

R

as probe table and

S

as build table,

R

is encoded by LZ. When

R

probing

S

,

LZ join

decreases the join cost by using

cached results

(previous join results of IDs in

R’

s LZ dictionary window when decoder find that the same

R

’s ID sequence in window).

LZ join

combines decoding and join phase into one, which reduces the memory usage for decoding the whole

R

and CPU overhead for probing those

cached results

. Our analysis and experiments show that

LZ join

is better in some cases, the more compression ratio the better.

Gan Liang, Li RunHeng, Jia Yan, Jin Xin

Potpourri

Exploiting Service Context for Web Service Search Engine

Service-oriented architecture (SOA) is rapidly becoming one of significant computing paradigms. However as the increasing of services, haphazardly of service definition makes it tedious and less efficient for service discovery. In this paper, we propose a novel context model “SPOT” to express services usage information. Based on SPOT definition, we build services’ collaboration graph and propose to analyze collaboration structure to rank services by their usage goodness. The distinctive feature of our method lies on the introducing of services context model which is a new model to deal with service context information, and integrating it for supporting service search. Our experimental results indicate that: our context-based ranking is useful for good services recommendation; services’ context makes up for service description heterogeneity and can help to distinguish content-“similar” services.

Rong Zhang, Koji Zettsu, Yutaka Kidawara, Yasushi Kiyoki

Building Business Intelligence Applications Having Prescriptive and Predictive Capabilities

Traditional Business Intelligence applications have focused on providing a one shop stop to integrate the enterprise information. The resulting applications are only capable of providing descriptive information viz: standard and ad-hoc reporting and drill-down capability. As part of an effort to provide prescriptive and predictive capability, we demonstrate a new architecture, methodology and implementation. Based on Cognos Business Intelligence platform and ILOG optimization engine, we showcase a truly predictive application that enables an optimal decision making in real-time analytical scenario.

Chen Jiang, David L. Jensen, Heng Cao, Tarun Kumar

FileSearchCube: A File Grouping Tool Combining Multiple Types of Interfile-Relationships

Files in computers are increasing in number, so we require file management tools to find target files and to classify large groups of files. Our research group has been developing a system that provides virtual directories made up of related files. Many methods to extract inter-file relationships are available, such as word frequency, access co-occurrence, and so on. In practice, users need to select and combine multiple types of inter-file relationships to form groups of files they want. For this purpose, we propose FileSearchCube, a tool for file group analysis that introduces the data-cube concept. FileSearchCube provides cube operations for a set of files, and helps users to find relevant file groups.

Yousuke Watanabe, Kenichi Otagiri, Haruo Yokota

Trustworthy Information: Concepts and Mechanisms

We used to treating information received (from recognized sources) as trustworthy, which is unfortunately not true because of attacks. The situation can get worse with the emerging shift of information sharing paradigm from “need to know” to “need to share.” In order to help information consumers make the “best” decision possible, it is imperative to formulate concepts, models, frameworks, architectures, and mechanisms to facilitate information trustworthiness management in distributed and decentralized environment. In this paper we initiate a study in this direction by proposing an abstraction called

information networks

as well as two supporting mechanisms called

provenance digital signatures

and

optimal security hardening of information network

.

Shouhuai Xu, Haifeng Qian, Fengying Wang, Zhenxin Zhan, Elisa Bertino, Ravi Sandhu

Web Data II

How to Design Kansei Retrieval Systems?

All of the information retrieval (IR) systems try to retrieve the information that users really want. For this purpose, users have to exactly express and submit their requirements to the systems. However, how to reflect user’s subjectivities is a hard problem. When a user wants to search for something, for instance a passenger car or a costume, he/she normally has a kind of feeling such as “graceful and looks intelligent, but not so expensive.” Many researchers call this feeling as “Kansei” in many researches, which means the user’s psychological feeling as well as the physiological issues. Let’s see another example. Image retrieval systems having an ability to handle subjective expressions are useful especially when the users, who have no knowledge about contents of the image database, try to use some Kansei words (e.g., “ beautiful”, “calm”, etc.) to retrieve unknown images. Unfortunately, traditional information retrieval systems cannot efficiently deal with the user-given search requests with Kansei words. Thus, in many application systems, how to deal with Kansei words from different users has become an important issue that the system designers have to confront. In this background, “Kansei engineering” has become one of the hot topics in IR field. In addition, many Kansei retrieval systems have been implemented. However, all of the existing Kansei retrieval systems aim at specific applications. Thus, they are very different from each other. In this paper, based on the many investigations and analysis on the existing systems, we propose a general method for designing efficient Kansei retrieval systems, which has not been done in this field. Our proposal mainly includes a general flow for designing Kansei retrieval systems and a discussion on how to speed up the Kansei retrieval processes using multidimensional indexes. We hope that this paper will be able to help the designers who are planning to design Kansei retrieval systems.

Yaokai Feng, Seiichi Uchida

Detecting Hot Events from Web Search Logs

Detecting events from web resources is a challenging task, attracting many attentions in recent years. Web search log is an important data source for event detection because the information it contains reflects users’ activities and interestingness to various real world events. There are three major issues for event detection from web search logs: effectiveness, efficiency and the organization of detected events. In this paper, we develop a novel Topic and Event Detection method,

TED

, to address these issues. We first divide the whole data into

topics

for efficiency consideration, and then incorporate link information, temporal information and query content to ensure the quality of detected events. Finally, events detected are organized through the proposed

interestingness measure

as well as topics they belong to. Experiments are conducted on a commercial search engine log. The results demonstrate that our method can effectively and efficiently detect hot events and give a meaningful organization of them.

Yingqin Gu, Jianwei Cui, Hongyan Liu, Xuan Jiang, Jun He, Xiaoyong Du, Zhixu Li

Evaluating Truthfulness of Modifiers Attached to Web Entity Names

To make online advertisements or user-generated content more attractive, people often use modifiers such as “authentic,” “impressive,” “special,” and so on. Some of these are exaggerations. That is, sometimes modifiers that are attached to Web entities do not represent the content appropriately. In this paper, we proposed a method to evaluate the truthfulness of modifiers attached to Web entity names by extracting relevant and conflicting terms from the content texts.

Ryohei Takahashi, Satoshi Oyama, Hiroaki Ohshima, Katsumi Tanaka

Searching the Web for Alternative Answers to Questions on WebQA Sites

We propose a method of discovering alternative answers from the Web that are to a question posted on a Web question & answer (Q&A) site and differ from existing answers to the question on the Q&A site. Our method first automatically generates queries for conventional Web search engines to collect Web pages that can contain answers to the target question. Each collected Web page is evaluated by calculating two kinds of scores: one represents the probability that the page has information that answers a question in the Q&A content and the other represents the possibility that it has an alternative answer. The method is implemented and the system is evaluated using actual Q&A contents.

Natsuki Takata, Hiroaki Ohshima, Satoshi Oyama, Katsumi Tanaka

Domain-Independent Classification for Deep Web Interfaces

The data sources of Deep Web provide tremendous structured data with high quality. However, classifying these interfaces with domain independent is required since the domains of the huge scale of deep web are hard to predefine. In this paper, we propose a novel approach with three-stage to solve this problem. First, we extract both texts and structure of a query interface by applying FIE algorithm we proposed. Then we construct frequent itemsets by using frequent pattern mining algorithm. Finally, we apply AP clustering algorithm to cluster the frequent itemsets according to similarity measure FGSTD presented in this paper. The experiment demonstrates our approach clusters interfaces well with domain independent.

Yingjun Li, Siwei Wang, Derong Shen, Tiezheng Nie, Ge Yu

Data Mining II

Data Selection for Exact Value Acquisition to Improve Uncertain Clustering

In recent years, data uncertainty widely attracts researchers’ attention because the amount of imprecise data is growing rapidly. Although data are not known exactly, probability distributions or expected errors are sometimes available. While most researchers on uncertain data mining are looking for methods to extract mining results from uncertain data, which is usually in the form of probability distributions or expected errors, it is also very important to lower the data uncertainty by making a part of data more certain to help get better mining results. For example, input values of some sensors in the sensor network are usually designed to be recorded more frequently than others because they are more important or more likely to change. In this paper, the issue of selecting a part of uncertain data and acquiring their exact values to improve clustering results is explored. Under a general uncertainty model, we propose both global and localized data selection methods, which can be used together with any existing uncertain clustering algorithm. Experimental results show that the quality of clustering improves after the selective exact value acquisition is applied.

Yu-Chieh Lin, De-Nian Yang, Ming-Syan Chen

Exploring the Sentiment Strength of User Reviews

Existing research efforts in sentiment analysis of online user reviews mainly focus on extracting features (such as quality and price) of products/services and classifying users’ sentiments into semantic orientations (such as positive, negative or neutral). However, few of them take the strength of user sentiments into consideration, which is particularly important in measuring the overall quality of products/services. Intuitively, different reviews for the same feature should have quite different sentiment strength, even though they may express the same polarity of sentiment. This paper presents an approach to estimating the sentiment strength of user reviews according to the strength of adverbs and adjectives expressed by users in their opinion phrases. Experimental result on a hotel review dataset in Chinese shows that the proposed approach is effective in the task of sentiment classification and achieves a good performance on a multi-scale evaluation.

Yao Lu, Xiangfei Kong, Xiaojun Quan, Wenyin Liu, Yinlong Xu

Semantic Entity Detection by Integrating CRF and SVM

Semantic entity detection is very important for extracting and representing the abundant semantic information of multimedia documents. In comparison with other media, e.g. video, image and audio, text expresses semantics more directly and often serves as a bridge in cross-media analysis. However, semantic entity detection from text is still a difficult problem because of the complexity of natural language. In this paper, we propose a novel framework which takes the advantages of both CRF (conditional random fields) and SVM (support vector machines), and present its application to semantic entity detection. Using this framework, context features are represented as the probability of entity boundary and extracted via CRF, and then linguistic and statistical features are extracted via large-scale text document analysis. Finally, all extracted features are integrated and used to perform the classification. As our algorithm systematically integrates the context, linguistic and statistical features, it may outperform traditional algorithms that only adopt part of the features.

Peng Cai, Hangzai Luo, Aoying Zhou

An Incremental Method for Causal Network Construction

We propose a novel method of constructing causal networks to clarify the relationships among events. Since events and their relationships continually change, our method works in an incremental manner. There are two problems in the conventional methods of constructing causal networks that use keywords representing events: 1) similar events detection to network construction is a time-consuming task, and 2) because the merge operation depends on appearing order of events there is a consistency issue in the incremental construction. In this paper, as the representation model, we propose a Topic-Event Causal network model (TEC model) in which the topic and details of an event are separately represented by using structured keywords. We cluster events by using the topic keyword and then detect similar events per each cluster. This fashion will reduce the comparison times of events. When we compute the similarity of two events in a topic, since we compare the pair of SVO tuples of vertices based on WordNet, we solve the difference in the word used and the lexical ambiguity and keep the consistency with a causal network. We also show experimental results to demonstrate its usefulness.

Hiroshi Ishii, Qiang Ma, Masatoshi Yoshikawa

DCUBE: CUBE on Dirty Databases

In the real world databases, dirty data such as inconsistent data, duplicate data affect the effectiveness of applications with database. It brings new challenges to efficiently process OLAP on the database with dirty data. CUBE is an important operator for OLAP. This paper proposes the CUBE operation based on overlapping clustering, and an effective and efficient storing and computing method for CUBE on the database with dirty data. Based on CUBE, this paper proposes efficient algorithms for answering aggregation queries, and the processing methods of other major operators for OLAP on the database with dirty data. Experimental results show the efficiency of the algorithms presented in this paper.

Guohua Jiang, Hongzhi Wang, Shouxu Jiang, Jianzhong Li, Hong Gao

XML and Images

An Algorithm for Incremental Maintenance of Materialized XPath View

Cached materialized view has lots of benefits, but it has inconsistency problem when the data source is modified. This paper proposes an incremental maintenance algorithm IMA to implement incremental maintenance of materialized XPath view. The characteristics of IMA are: (1) It can establish an incremental maintenance program automatically; (2) The incremental maintenance program is simple and its function is reassembling the materialized XPath view to obtain a consistent view; (3) The incremental maintenance program is written in XQuery language and can be executed by any XQuery engine; (4) The only auxiliary data used to maintain the materialized XPath view is the incremental maintenance program and its size cost is very small. Experiments have shown that this approach outperforms view recomputation.

Xueyun Jin, Husheng Liao

Query Processing in INM Database System

Real-world objects have various natural and sometimes complex relationships with each other. Via these relationships, they play various roles and then have the corresponding context-dependent properties. To naturally and directly model these situations, we have proposed a novel data model called Information Networking Model (INM). We have also designed information specification language (ISL), information manipulation language (IML) and information query language (IQL) and implemented a database management system based on INM The query language can explore the natural graph structure of INM objects to extract meaningful information in a simple, concise, and compact way. In this paper, we focus on query processing in INM database system. We have used four kinds of search strategies and the system can automatically select an appropriate strategy to efficiently process queries without user’s intervention.

Jie Hu, Qingchuan Fu, Mengchi Liu

Fragile Watermarking for Color Image Recovery Based on Color Filter Array Interpolation

In this paper, a novel fragile watermarking method is proposed for color image recovery via color filter array interpolation. First we turn a color image into a grey image by color sampling in the form of Bayer patterns. Data of the grey image which is reordered as reference bits are then embedded into the original image along with some authentication bits generated from the image. After the watermarked image is received, authentication bits are extracted to identify which regions are tampered. When the image is detected to be tampered, extracted reference bits are used to reconstruct a color image by color filter array interpolation, which is then used to recover the content of the tampered regions. Experiments show this method has good recovery ability.

Zhenxing Qian, Guorui Feng, Yanli Ren

A Hybrid-Feature-Based Efficient Retrieval over Chinese Calligraphic Manuscript Image Repository

In this paper, we propose an interactive indexing scheme to facilitate an efficient search of Chinese calligraphic manuscript images based on hybrid features such as contour points and number of strokes. Different from conventional character retrieval and indexing methods [6], our proposed indexing algorithm allows user to choose other feature such as stroke of character they prefer to as query element. To better facilitate an efficient and unified retrieval, we then propose a

Hybrid

-

Feature-Tree

(HF-Tree)-based high- dimensional indexing scheme. Comprehensive experiments are conducted to testify its effectiveness and efficiency.

Yi Zhuang, Chengxiang Yuan

Efficient Filtering of XML Documents with XPath Expressions Containing Ancestor Axis

In this paper, we address the problem of filtering XML documents with large number of XPath expressions, which contain predicates with axes ‘ancestor’, ‘descendant’ and ‘child’. We propose a novel index structure, called NIndex, to index those complex XPath expressions. Based on NIndex, we proposed a new filtering algorithm with lower complexity for our problem. Our experiment results show that our algorithm performs well across a range of XPath expressions and documents.

Bo Ning, Chengfei Liu, Guoren Wang

New Hardware

ACAR: An Adaptive Cost Aware Cache Replacement Approach for Flash Memory

Flash memory has been gaining more popularity as a substitution for magnetic disk. However, due to asymmetric IO latency, cache management policy needs to be reconsidered in systems equipped with flash. A novel buffer replacement approach named ACAR, which stands for Adaptive Cost Aware cache Replacement, is proposed in this paper to address this problem. Taking operation cost into consideration, ACAR allocates two pools for clean and dirty pages separately. In addition, dynamical pool size tuning is also performed according to IO pattern evolvement. Furthermore, hot data recognition capacity is realized in an enhanced version of ACAR. Experiments with artificial and real IO traces demonstrate ACAR outperforms the state-of-the-art cache replacement strategies.

Yanfei Lv, Xuexuan Chen, Bin Cui

GPU-Accelerated Predicate Evaluation on Column Store

Column scan, or predicate evaluation and filtering over a column of data in a database table, is an important primitive for data mining and data warehousing. In this paper, we present our study on accelerating column scan using a massively parallel accelerator. With a design that takes full advantage of the characteristics of the memory hierarchy and parallel execution in such processors, we have achieved very attractive speedup performance that significantly exceeds previously reported results, making the use of such an accelerator for this type of primitives much more viable. Running on a general purpose graphic processor unit (GPGPU), NVidia GTX 280 GPU, the GPU version is about 5-6 times faster than an implementation on an eight-core CPU, or over 40 times faster than that on a single-core CPU.

Ren Wu, Bin Zhang, Meichun Hsu, Qiming Chen

MOSS-DB: A Hardware-Aware OLAP Database

The data intensive analytical workload becomes heavy burden for OLAP engine with increasing data volume, user population and query complexity. Large capacity random access memory, multi-level cache and multi-core hardware are main streams of computer. We propose a hardware-aware OLAP model named MOSS-DB which optimizes storage model according to data access features of dimensional tables and fact tables. A hard disk & main memory two-level storage model is employed to support directly dimensional tuple accessing join operator(DDTA-JOIN), DDTA-JOIN simplifies OLAP query processing by replacing traditional join operation with directly accessing dimensional tuple with memory address. So the star schema can be seen as virtual de-normalized table, OLAP query is also simplified to table scan, select and project operations. Query processing on sequence data structure is more suitable for multi-core parallel processing. Our proposal allows massive data DRDB(Disk Resident Database) storage technique to co-operate with MMDB(Main-Memory Database) processing technique, which breaks the main memory capacity limitation. The DDTA-JOIN operation can save cost for index, hash table, etc. For multi-core era, MOSS-DB can flexibly use parallel processing capability of CPU by dynamically dividing fact table into multiple scan partitions and gain maximum cache profit for shared dimensional data. In experiments, we measure that MOSS-DB outperforms conventional DRDB system, and it also outperforms MMDB in SSB testing.

Yansong Zhang, Wei Hu, Shan Wang

Similarity Search

Efficient Duplicate Record Detection Based on Similarity Estimation

In information integration systems, duplicate records bring problems in data processing and analysis. To represent the similarity between two records from different data sources with different schema, the optimal bipartite graph matching is adopted on the attributes of them and the similarity is measured as the weight of such matching. However, the intuitive method has two aspects of shortcomings. The one in efficiency is that it needs to compare all records pairwise. The one in effectiveness is that a strict duplicate records judgment condition results in a low rate of recall. To make the method work in practice, an efficient method is presented in this paper. Based on similarity estimation, the basic idea is to estimate the range of the records similarity in O(1) time, and to determine whether they are duplicate records according to the estimation. Theoretical analysis and experimental results show that the method is effective and efficient.

Mohan Li, Hongzhi Wang, Jianzhong Li, Hong Gao

A Novel Composite Kernel for Finding Similar Questions in CQA Services

Finding similar questions in Community Question Answering (CQA) services plays more and more important role in current web and IR applications. The task aims to retrieve historical questions that are similar or relevant to new questions posed by users. However, traditional “bag-of-words” based models would fail to measure the similarity between question sentences, as they usually ignore sequential and syntactic information. In this paper, we propose a novel composite kernel to improve the accuracy in question matching. Our study illustrate that the composite kernel can efficiently capture both lexical semantics and syntactic information in a question sentence by leveraging word sequence kernel, POS tag sequence kernel and syntactic tree kernel. Experimental results on real world datasets show that our proposed method significantly outperforms the state-of-the-art models.

Jun Wang, Zhoujun Li, Xia Hu, Biyun Hu

Efficient Similarity Query in RFID Trajectory Databases

Similarity query is one of the most important operations in trajectory databases and this paper addresses this problem in RFID trajectory databases. Due to the special sensing manner, the model of RFID trajectory is different from the traditional trajectory, leading to the inefficiency of existing techniques. This paper proposes a novel distance function—

EDEU(Edit Distance combined with Euclidean Distance)

, which supports the local time shifting and takes the distance between adjacent logic areas into consideration. We also develop two filter-refinement mechanisms based on

Co-occurrence Degree

and

Length Dispersion Ratio

to improve the efficiency of the similarity analysis. Furthermore, we extend our solution to determine the local similarity from the global dissimilarity trajectory pairs. The extensive experiments verify the efficiency of our proposed algorithms.

Yanqiu Wang, Ge Yu, Yu Gu, Dejun Yue, Tiancheng Zhang

Information Extraction

Context-Aware Basic Level Concepts Detection in Folksonomies

This paper deals with the problem of exploring implicit semantics in folksonomies. In folksonomies, users create and manage tags to annotate web resources. The collection of user-created tags in folksonomies is a potential semantics source. Much research has been done to extract concepts, and even concepts hierarchy (ontology), which is the important component for knowledge representation (e.g. in semantic web and agent communication), from folksonomies. However, there has been no metric for discovering human acceptable and agreeable concepts, and thus many concepts extracted from folksonomies by existing approaches are not natural for human use. In cognitive psychology, there is a family of concepts named basic level concepts which are frequently used by people in daily life, and most human knowledge is organized by basic level concepts. Thus, extracting basic level concepts from folksonomies is more meaningful for categorizing and organizing web resources than extracting concepts in other granularity. In addition, context plays an important role in basic level concepts detection, as the basic level concepts in the same domain become different in different contexts. In this paper, we propose a method to detect basic level concepts in different contexts from folksonomies. Using Open Directory Project (ODP) as the benchmark, we demonstrate the existence of context effect and the effectiveness of our method.

Wen-hao Chen, Yi Cai, Ho-fung Leung, Qing Li

Extracting 5W1H Event Semantic Elements from Chinese Online News

This paper proposes a verb-driven approach to extract 5W1H (

W

ho,

W

hat,

W

hom,

W

hen,

W

here and

H

ow) event semantic information from Chinese online news. The main contributions of our work are two-fold: First, given the usual structure of a news story, we propose a novel algorithm to extract topic sentences by stressing the importance of news headline; Second, we extract event facts (i.e. 5W1H) from these topic sentences by applying a rule-based method (verb-driven) and a supervised machine-learning method (SVM). This method significantly improves the predicate-argument structure used in Automatic Content Extraction (ACE) Event Extraction (EE) task by considering valency (dominant capacity to noun phrases) of a Chinese verb. Extensive experiments on ACE 2005 datasets confirm its effectiveness and it also shows a very high scalability, since we only consider the topic sentences and surface text features. Based on this method, we build a prototype system named Chinese News Fact Extractor (CNFE). CNFE is evaluated on a real world corpus containing 30,000 newspaper documents. Experiment results show that CNFE can extract event facts efficiently.

Wei Wang, Dongyan Zhao, Lei Zou, Dong Wang, Weiguo Zheng

Automatic Domain Terminology Extraction Using Graph Mutual Reinforcement

Information Extraction (IE) aims at mining knowledge from unstructured data. Terminology extraction is one of crucial subtasks in IE. In this paper, we propose a novel approach of domain terminology extraction based on ranking, according to linkage of authors, papers and conferences in domain proceedings. Candidate terms are extracted by statistical methods and then ranked by the values of importance derived from mutual reinforcement result in the author-paper-conference graph. Furthermore, we integrate our approach with several classical termhood-based methods including C-value and inverse document frequency. The presented approach does not require any training data, and can be extended to other domains. Experimental results show that our approach outperforms several competitive methods.

Jingjing Kang, Xiaoyong Du, Tao Liu, He Hu

Knowledge Discovery

Semi-supervised Learning from Only Positive and Unlabeled Data Using Entropy

The problem of

classification from positive and unlabeled examples

attracts much attention currently. However, when the number of unlabeled negative examples is very small, the effectiveness of former work has been decreased. This paper propose an effective approach to address this problem, and we firstly use entropy to selects the likely positive and negative examples to build a complete training set; and then logistic regression classifier is applied on this new training set for classification. A series of experiments are conducted. The experimental results illustrate that the proposed approach outperforms previous work in the literature.

Xiaoling Wang, Zhen Xu, Chaofeng Sha, Martin Ester, Aoying Zhou

Margin Based Sample Weighting for Stable Feature Selection

Stability of feature selection is an important issue in knowledge discovery from high-dimensional data. A key factor affecting the stability of a feature selection algorithm is the sample size of training set. To alleviate the problem of small sample size in high-dimensional data, we propose a novel framework of margin based sample weighting which extensively explores the available samples. Specifically, it exploits the discrepancy among local profiles of feature importance at various samples and weights a sample according to the outlying degree of its local profile of feature importance. We also develop an efficient algorithm under the framework. Experiments on a set of public microarray datasets demonstrate that the proposed algorithm is effective at improving the stability of state-of-the-art feature selection algorithms, while maintaining comparable classification accuracy on selected features.

Yue Han, Lei Yu

Associative Classifier for Uncertain Data

Associative classifiers are relatively easy for people to understand and often outperform decision tree learners on many classification problems. Existing associative classifiers only work with certain data. However, data uncertainty is prevalent in many real-world applications such as sensor network, market analysis and medical diagnosis. And uncertainty may render many conventional classifiers inapplicable to uncertain classification tasks. In this paper, based on U-Apriori algorothm and CBA algorithm, we propose an associative classifier for uncertain data, uCBA (uncertain Classification Based on Associative), which can classify both certain and uncertain data. The algorithm redefines the support, confidence, rule pruning and classification strategy of CBA. Experimental results on 21 datasets from UCI Repository demonstrate that the proposed algorithm yields good performance and has satisfactory performance even on highly uncertain data.

Xiangju Qin, Yang Zhang, Xue Li, Yong Wang

Information Integration

Automatic Multi-schema Integration Based on User Preference

Schema integration plays a central role in numerous database applications, such as Deep Web, DataSpaces and Ontology Merging. Although there have been many researches on schema integration, they all neglect user preference which is a very important factor for improving the quality of mediated schemas. In this paper, we propose the automatic multi-schema integration based on user preference. A new concept named reference schema is introduced to represent user preference. This concept can guide the process of integration to generate mediated schemas according to user preference. Different from previous solutions, our approach employs

F

-measure and “attribute density” to measure the similarity between schemas. Based on this similarity, we design a top-

k

ranking algorithm that retrieves

k

mediate schemas which users really expect. The key component of the algorithm is a pruning strategy which makes use of Divide and Conquer to narrow down the search space of the candidate schemas. Finally, the experimental study demonstrates the effectiveness and good performance of our approach.

Guohui Ding, Guoren Wang, Junchang Xin, Huichao Geng

EIF: A Framework of Effective Entity Identification

Entity identification, that is to build corresponding relationships between objects and entities in dirty data, plays an important role in data cleaning. The confusion between entities and their names often results in dirty data. That is, different entities may share the identical name and different names may correspond to the identical entity. Therefore, the major task of entity identification is to distinguish entities sharing the same name and recognize different names referring to the same entity. However, current research focuses on only one aspect and cannot solve the problem completely. To address this problem, in this paper, EIF, a framework of entity identification with the consideration of the both kinds of confusions, is proposed. With effective clustering techniques, approximate string matching algorithms and a flexible mechanism of knowledge integration, EIF can be widely used to solve many different kinds of entity identification problems. In this paper, as an application of EIF, we solved the author identification problem. The effectiveness of this framework is verified by extensive experiments.

Lingli Li, Hongzhi Wang, Hong Gao, Jianzhong Li

A Multilevel and Domain-Independent Duplicate Detection Model for Scientific Database

The duplicate detection is one of technical difficulties in data cleaning area. At present, the data volume of scientific database is increasing rapidly, bringing new challenges to the duplicate detection. In the scientific database, the duplicate detection model should be suitable for massive and numerical data, should independent from the domains, should well consider the relationships among tables, and should focus on common grounds of the scientific database. In the paper, a multilevel duplicate detection model for scientific database is proposed, which consider numerical data and general usage well. Firstly, the challenges are propose by analyzing duplicate-related characteristics of scientific data; Secondly, similarity measure of the proposed model are defined; Then the details of multilevel detecting algorithms are introduced; At last, some experiments and applications show that the proposed model is more domain-independent and effective, suitable for duplicate detection in scientific database.

Jie Song, Yubin Bao, Ge Yu

Extending Databases

Generalized UDF for Analytics Inside Database Engine

Running analytics computation inside a database engine through the use of UDFs (User Defined Functions) has been investigated, but not yet become a scalable approach due to several technical limitations. One limitation lies in the lack of generality for UDFs to express complex applications and to compose them with relational operators in SQL queries. Another limitation lies in the lack of systematic support for a UDF to cache relations initially for efficient computation in multi-calls. Further, having UDF execution interacted efficiently with query processing requires detailed system programming, which is often beyond the expertise of most application developers.

To solve these problems, we extend the UDF technology in both semantic and system dimensions. We generalize UDF to support scalar, tuple as well as

relation input

and output, allow UDFs to be defined on the entire content of relations and allow the moderate-sized input relations to be cached in initially to avoid repeated retrieval. With such extension the generalized UDFs can be composed with other relational operators and thus integrated into queries naturally. Furthermore, based on the notion of

invocation patterns

, we provide focused system support for efficiently interacting UDF execution with query processing.

We have taken the open-sourced PostgreSQL engine and a commercial and proprietary parallel database engine as our prototyping vehicles; we illustrated the performance, modeling power and usability of the proposed approach with the experimental results on both platforms.

Meichun Hsu, Qiming Chen, Ren Wu, Bin Zhang, Hans Zeller

Efficient Continuous Top-k Keyword Search in Relational Databases

Keyword search in relational databases has been widely studied in recent years. Most of the previous studies focus on how to answer an instant keyword query. In this paper, we focus on how to find the top-

k

answers in relational databases for continuous keyword queries efficiently. As answering a keyword query involves a large number of join operations between relations, reevaluating the keyword query when the database is updated is rather expensive. We propose a method to compute a range for the future relevance score of query answers. For each keyword query, our method computes a state of the query evaluation process, which only contains a small amount of data and can be used to maintain top-

k

answers when the database is continually growing. The experimental results show that our method can be used to solve the problem of responding to continuous keyword searches for a relational database that is updated frequently.

Yanwei Xu, Yoshiharu Ishikawa, Jihong Guan

V Locking Protocol for Materialized Aggregate Join Views on B-Tree Indices

Immediate materialized view maintenance with transactional consistency is highly desirable to support real-time decision making. Nevertheless, due to high deadlock rates, such maintenance can cause significant performance degradation in the database system. To increase concurrency during such maintenance, we previously proposed the V locking protocol for materialized aggregate join views and showed how to implement it on hash indices. In this paper, we address the thorny problem of implementing the V locking protocol on B-tree indices. We also formally prove that our techniques are both necessary and sufficient to ensure correctness (serializability).

Gang Luo

Web Information Credibility

World Wide Web is the biggest repository of information and knowledge. Such information gives people a framework for organizing their private and professional lives. Research aimed at evaluating the credibility of Web content has recently become increasingly crucial because the Web has started to influence our daily lives. The abundance of content on the Web, the lack of publishing barriers, and poor quality control of Web content raise credibility issues. If users are not aware of the credibility of Web information, they can be easily misled, and sometimes it is dangerous to users.

For example, some researchers reported that there are more than twenty thousand health-related sites on the Web, but more than half of such sites have not been reviewed by medical specialists. Wikipedia has been more popular on the Web, but the risks of Wikipedia are also indicated from the viewpoint of credibility. There are a lot of exaggerated ads and fake images and movies. The importance of the image forensic research also becomes important.

Many dimensions concerned with the information credibility are grouped into two key components: expertise and trustworthiness. Expertise is a factor about the writer’s ability to produce correct or fair information and the degree to which the reader can perceive knowledge and skill from the information. The expertise factor is defined by the terms knowledgeable, experienced, competent, and so on. The trustworthiness is a factor about readers’ perceptions that the information is true as they know it, and it is the degree to which readers can perceive the goodness or morality of the target information. The trustworthiness factor is defined by the terms well-intentioned, unbiased, reputable, and so on. In the areas of Web search and mining, however, most of conventional research has focused on ranking search results based on popularity by analyzing link structures or on mining useful rules from the Web. They have not focused on the analysis of the credibility of target information.

Consequently, few users perform rigorous evaluations of the credibility of obtained information. Therefore, the exploration of a general framework and automatic tools for supporting users in the judgment of web content credibility are becoming increasingly necessary.

In this talk, we describe a new framework and methods for evaluating the Web information credibility. These include: a bipartite-graph framework for evaluating the credibility of relations, and several methods for analyzing Web information credibility from the viewpoint of (1) content analysis, (2) social support analysis and (3) author analysis.

Katsumi Tanaka

Springer Professional

Inhaltsverzeichnis

Frontmatter

Analyzing Data Quality Using Data Auditor

Analyzing Data Quality Using Data Auditor

Rebuilding the World from Views

Approximate Query Processing in Sensor Networks

Web Data I

Duplicate Identification in Deep Web Data Integration

Learning to Detect Web Spam by Genetic Programming

Semantic Annotation of Web Objects Using Constrained Conditional Random Fields

Time Graph Pattern Mining for Web Analysis and Information Retrieval

Networked Data

FISH: A Novel Peer-to-Peer Overlay Network Based on Hyper-deBruijn

Continuous Summarization of Co-evolving Data in Large Water Distribution Network

Proactive Replication and Search for Rare Objects in Unstructured Peer-to-Peer Networks

SWORDS: Improving Sensor Networks Immunity under Worm Attacks

Efficient Multiple Objects-Oriented Event Detection Over RFID Data Streams

Social Networks

CW2I: Community Data Indexing for Complex Query Processing

Clustering Coefficient Queries on Massive Dynamic Social Networks

Predicting Best Answerers for New Questions in Community Question Answering

Semantic Grounding of Hybridization for Tag Recommendation

Rich Ontology Extraction and Wikipedia Expansion Using Language Resources

Cloud Computing

Fine-Grained Cloud DB Damage Examination Based on Bloom Filters

XML Structural Similarity Search Using MapReduce

Comparing Hadoop and Fat-Btree Based Access Method for Small File I/O Applications

Data Mining I

Mining Contrast Inequalities in Numeric Dataset

Users’ Book-Loan Behaviors Analysis and Knowledge Dependency Mining

An Extended Predictive Model Markup Language for Data Mining

A Cross-Media Method of Stakeholder Extraction for News Contents Analysis

Stream Processing

An Efficient Approach for Mining Segment-Wise Intervention Rules in Time-Series Streams

Automated Recognition of Sequential Patterns in Captured Motion Streams

Online Pattern Aggregation over RFID Data Streams

Cleaning Uncertain Streams by Parallelized Probabilistic Graphical Models

Graph Processing

Taming Computational Complexity: Efficient and Parallel SimRank Optimizations on Undirected Graphs

DSI: A Method for Indexing Large Graphs Using Distance Set

K-Radius Subgraph Comparison for RDF Data Cleansing

Query Processing

A Novel Framework for Processing Continuous Queries on Moving Objects

Group Visible Nearest Neighbor Queries in Spatial Databases

iPoc: A Polar Coordinate Based Indexing Method for Nearest Neighbor Search in High Dimensional Space

Join Directly on Heavy-Weight Compressed Data in Column-Oriented Database

Potpourri

Exploiting Service Context for Web Service Search Engine

Building Business Intelligence Applications Having Prescriptive and Predictive Capabilities

FileSearchCube: A File Grouping Tool Combining Multiple Types of Interfile-Relationships

Trustworthy Information: Concepts and Mechanisms

Web Data II

How to Design Kansei Retrieval Systems?

Detecting Hot Events from Web Search Logs

Evaluating Truthfulness of Modifiers Attached to Web Entity Names

Searching the Web for Alternative Answers to Questions on WebQA Sites

Domain-Independent Classification for Deep Web Interfaces

Data Mining II

Data Selection for Exact Value Acquisition to Improve Uncertain Clustering

Exploring the Sentiment Strength of User Reviews

Semantic Entity Detection by Integrating CRF and SVM

An Incremental Method for Causal Network Construction

DCUBE: CUBE on Dirty Databases

XML and Images

An Algorithm for Incremental Maintenance of Materialized XPath View

Query Processing in INM Database System

Fragile Watermarking for Color Image Recovery Based on Color Filter Array Interpolation

A Hybrid-Feature-Based Efficient Retrieval over Chinese Calligraphic Manuscript Image Repository

Efficient Filtering of XML Documents with XPath Expressions Containing Ancestor Axis

New Hardware

ACAR: An Adaptive Cost Aware Cache Replacement Approach for Flash Memory

GPU-Accelerated Predicate Evaluation on Column Store

MOSS-DB: A Hardware-Aware OLAP Database

Similarity Search

Efficient Duplicate Record Detection Based on Similarity Estimation

A Novel Composite Kernel for Finding Similar Questions in CQA Services

Efficient Similarity Query in RFID Trajectory Databases

Information Extraction

Context-Aware Basic Level Concepts Detection in Folksonomies