Similarity

Efficient Histogram-Based Similarity Search in Ultra-High Dimensional Space

Recent development in image content analysis has shown that the dimensionality of an image feature can reach thousands or more for satisfactory results in some applications such as face recognition. Although high-dimensional indexing has been extensively studied in database literature, most existing methods are tested for feature spaces with less than hundreds of dimensions and their performance degrades quickly as dimensionality increases. Given the huge popularity of histogram features in representing image content, in this papers we propose a novel indexing structure for efficient histogram based similarity search in ultra-high dimensional space which is also sparse. Observing that all possible histogram values in a domain form a finite set of discrete states, we leverage the time and space efficiency of inverted file. Our new structure, named two-tier inverted file, indexes the data space in two levels, where the first level represents the list of occurring states for each individual dimension, and the second level represents the list of occurring images for each state. In the query process, candidates can be quickly identified with a simple weighted state-voting scheme before their actual distances to the query are computed. To further enrich the discriminative power of inverted file, an effective state expansion method is also introduced by taking neighbor dimensions’ information into consideration. Our extensive experimental results on real-life face datasets with 15,488 dimensional histogram features demonstrate the high accuracy and the great performance improvement of our proposal over existing methods.

Jiajun Liu, Zi Huang, Heng Tao Shen, Xiaofang Zhou

A Retrieval Strategy Using the Integrated Knowledge of Similarity and Associations

Retrieval is often considered the most important task in Case-Based Reasoning (CBR), since it lays the foundation for overall performance of CBR systems. In CBR, a typical retrieval strategy is realized through similarity knowledge encoded in similarity measures. This strategy is often called similarity-based retrieval (SBR). This paper proposes and validates that association analysis techniques can be used to improve SBR. We propose a retrieval strategy USIMSCAR that performs the retrieval task by integrating similarity and association knowledge. We show its reliability, in comparison with several retrieval methods implementing SBR, using datasets from UCI ML Repository.

Yong-Bin Kang, Shonali Krishnaswamy, Arkady Zaslavsky

PG-Skip: Proximity Graph Based Clustering of Long Strings

String data is omnipresent and appears in a wide range of applications. Often string data must be partitioned into clusters of similar strings, for example, for cleansing noisy data. A promising string clustering approach is the recently proposed Graph Proximity Cleansing (GPC). A distinguishing feature of GPC is that it automatically detects the cluster borders without knowledge about the underlying data, using the so-called proximity graph. Unfortunately, the computation of the proximity graph is expensive. In particular, the runtime is high for long strings, thus limiting the application of the state-of-the-art GPC algorithm to short strings.

In this work we present two algorithms, PG-Skip and PG-Binary, that efficiently compute the GPC cluster borders and scale to long strings. PG-Skip follows a prefix pruning strategy and does not need to compute the full proximity graph to detect the cluster border. PG-Skip is much faster than the state-of-the-art algorithm, especially for long strings, and computes the exact GPC borders. We show the optimality of PG-Skip among all prefix pruning algorithms. PG-Binary is an efficient approximation algorithm, which uses a binary search strategy to detect the cluster border. Our extensive experiments on synthetic and real-world data confirm the scalability of PG-Skip and show that PG-Binary approximates the GPC clusters very effectively.

Michail Kazimianec, Nikolaus Augsten

An Effective Approach for Searching Closest Sentence Translations from the Web

There are large numbers of well-translated sentence pairs on the Web, which can be used for translating sentences in different languages. It is an interesting problem to search the closest sentence translations from the Web for high-quality translation, which has attracted significant attention recently. However, it is not straightforward to develop an effective approach, as this task heavily depends on the effectiveness of the similarity model which is used to quantify the similarity between two sentences. In this paper, we propose several optimization techniques to address this problem. We devise a phrase-based model to quantify the similarity between two sentences. We judiciously select high-quality phrases from sentences, which can capture the key features of sentences and thus can be used to quantify similarity between sentences. Experimental results show that our approach has performance advantages compared with the state-of-the-art sentence matching methods.

Ju Fan, Guoliang Li, Lizhu Zhou

Searching and Digital Preservation

Finding the Sites with Best Accessibilities to Amenities

Finding the most accessible locations has a number of applications. For example, a user may want to find an accommodation that is close to different amenities such as schools, supermarkets, and hospitals etc. In this paper, we study the problem of finding the most accessible locations among a set of possible sites. The task is converted to a top-

k

query that returns

k

points from a set of sites

R

with the best accessibilities. Two R-tree based algorithms are proposed to answer the query efficiently. Experimental results show that our proposed algorithms are several times faster than a baseline algorithm on large-scale real datasets under a wide range of parameter settings.

Qianlu Lin, Chuan Xiao, Muhammad Aamir Cheema, Wei Wang

Audio Lifelog Search System Using a Topic Model for Reducing Recognition Errors

A system that records daily conversations is one of the most useful types of lifelogs. It is, however, not widely used due to the low precision of speech recognizers when applied to conversations. To solve this problem, we propose a method that uses a topic model to reduce incorrectly recognized words. Specifically, we measure relevancy between a term and the other words in the conversation and remove those that come below the threshold. An audio lifelog search system was implemented using the method. Experiments showed that our method is effective in compensating recognition errors of speech recognizers. We observed increase in both precision and recall. The results indicate that our method has an ability to reduce errors in the index of a lifelog search system.

Taro Tezuka, Akira Maeda

Towards Web Search by Sentence Queries: Asking the Web for Query Substitutions

In this paper, we propose a method to search the Web for sentence substitutions for a given sentence query. Our method uses only lexico-syntactic patterns dynamically generated from the input sentence query to collect sentence substitutions from the Web on demand. Experimental results show that our method works well and can be used to obtain sentence substitutions for rare sentence queries as well as for popular sentence queries. It is also shown that our method can collect various types of sentence substitutions such as paraphrases, generalized sentences, detailed sentences, and comparative sentences. Our method searches for sentence substitutions whose expressions appear most frequently on the Web. Therefore, even if users issue the sentence query by which Web search engines return no or few search results for some reasons, our method enables users to collect more Web pages about the given sentence query or the sentences related to the query.

Yusuke Yamamoto, Katsumi Tanaka

The DISTARNET Approach to Reliable Autonomic Long-Term Digital Preservation

The rapidly growing production of digital data, together with their increasing importance and essential demands for their longevity, urgently require systems that provide reliable long-term preservation of digital objects. Most importantly, these systems have to ensure guaranteed availability over a long period of time, as well as integrity and authenticity of the preserved data and their metadata. This means that all kinds of technical problems need to be reliably handled and that the evolution of data formats is supported. At the same time, systems need to scale with the volume of data to be archived. In this paper, we present DISTARNET, a fully distributed system that reliably executes pre-defined workflows for long-term preservation. Moreover, DISTARNET is designed as an open system that allows the curators of digital objects to specify new processes to cope with additional challenges.

Ivan Subotic, Heiko Schuldt, Lukas Rosenthaler

Spatial Queries

A Unified Algorithm for Continuous Monitoring of Spatial Queries

Continuous monitoring of spatial queries has gained significant research attention in the past few years. Although numerous algorithms have been proposed to solve specific queries, there does not exist a unified algorithm that solves a broad class of spatial queries. In this paper, we first define a

versatile top-k

query

and show that various important spatial queries can be modeled to a versatile top-

k

query by defining a suitable scoring function. Then, we propose an efficient algorithm to continuously monitor the versatile top-

k

queries. To show the effectiveness of our proposed approach, we model various inherently different spatial queries to the versatile top-

k

query and conduct experiments to show the efficiency of our unified algorithm. The extensive experimental results demonstrate that our unified algorithm is several times faster than the existing best known algorithms for monitoring constrained

k

nearest neighbors queries, furthest

k

neighbors queries and aggregate

k

nearest neighbors queries.

Mahady Hasan, Muhammad Aamir Cheema, Xuemin Lin, Wenjie Zhang

Real-Time Monitoring of Moving Objects Using Frequently Used Routes

Recently, real-time monitoring of moving objects (MOs), such as cars and humans, has attracted interest. In these systems, tracking trajectories and positions of MOs accurately and for the least communication cost is an important goal. To achieve this, MO position prediction is important. Some studies have proposed real-time monitoring systems that share route information between MOs and a server to predict MO position; however, these systems target public transportation, such as buses, for which the route is always fixed. To the best of the authors’ knowledge, a real-time monitoring method for sharing route information between passenger cars and a central server does not exist. This paper proposes such a method using the sharing of frequently used routes (FUR) data between each MO and the server. The server and the MO predict near-future position of MO on the basis of the FUR information using a common algorithm. When the position of the MO deviates from the prediction, it sends a message to the server to maintain position synchronization. This paper evaluates the proposed method using real trajectories of cars and shows that the proposed method outperforms the conventional method, that is, dead-reckoning on a road network.

Yutaka Ohsawa, Kazuhisa Fujino, Htoo Htoo, Aye Thida Hlaing, Noboru Sonehara

wNeighbors: A Method for Finding k Nearest Neighbors in Weighted Regions

As the fundamental application of Location Based Service (LBS),

k

nearest neighbors query has received dramatic attention. In this paper, for the first time, we study how to monitor the weighted

k

nearest neighbors(W

k

NN) in a novel weighted space to reflect more complex scenario. Different from traditional

k

NN approaches, the distances are measured according to a weighted Euclidean metric. The length of a path is defined to be the sum of its weighted subpaths, where a weighted subpath is relative to the weights of its passing regions. By dividing the plane into a set of Combination Regions, a data structure “Weighted Indexing Map”(WIM) is presented. The WIM keeps an index of the weighted length information. Based on WIM, we propose an efficient algorithm, called

w

Neighbors, for answering the W

k

NN query. The experimental results show that our WIM based W

k

NN processing algorithm are effective and efficient.

Chuanwen Li, Yu Gu, Ge Yu, Fangfang Li

Aggregate Farthest-Neighbor Queries over Spatial Data

In this paper, we study a new type of spatial query, namely

aggregate k farthest neighbor

(AkFN) search. Given a data point set

P

, a query point set

Q

, an AkFN query returns

k

points in

P

with the largest aggregate distances to all points in

Q

. For instance, it is reasonable to build a new hotel where the aggregate distances to all existing hotels are maximized to reduce competition. Our investigation of AkFN queries focuses on three aggregate functions, namely

Sum

,

Max

and

Min

. Assuming that the data set is indexed by R-tree, we propose two algorithms, namely

minimum bounding

(MB) and

best first

(BF), for efficiently solving AkFN queries with all three aggregate functions. The BF algorithm is incremental and IO optimal. Extensive experiments on both synthetic and real data sets confirm the efficiency and effectiveness of our proposed algorithms.

Yuan Gao, Lidan Shou, Ke Chen, Gang Chen

Query Processing I

Querying Business Process Models Based on Semantics

In recent years, the technology of business process management is being more widely used, so that there are more and more business process models (graphs). How to manage such a large number of business process models is challenging, among which the business process model query is a basic function. For example, based on business process model query, the model designer can find the related models and evolve them instead of starting from scratch. It will save a lot of time and is less error-prone. To this end, we propose a language (

BQL

) for users to express their requirements based on semantics. For efficiency, we adopt an efficient method to compute the semantic features of business process models and use indexes to support the query processing. To make our approach more applicable, we consider the semantic similarity between labels. Our approach proposed in this paper is implemented in our system BeehiveZ. Analysis and experiments show that our approach works well.

Tao Jin, Jianmin Wang, Lijie Wen

Discovering Implicit Categorical Semantics for Schema Matching

Attribute-level schema matching is a critical step in numerous database applications, such as DataSpaces, Ontology Merging and Schema Integration. There exist many researches on this topic, however, they ignore the implicit categorical information which is crucial to find high-quality matches between schema attributes. In this paper, we discover the categorical semantics implicit in source instances, and associate them with the matches in order to improve overall quality of schema matching. Our method works in three phases. The first phase is a pre-detecting step that detects the possible categories of source instances by using clustering techniques. In the second phase, we employ

information entropy

to find the attributes whose instances imply the categorical semantics. In the third phase, we introduce a new concept

c-mapping

to represent the associations between the matches and the categorical semantics. Then, we employ an adaptive

scoring function

to evaluate the

c-mappings

to achieve the task of associating the matches with the semantics. Moreover, we show how to translate the matches with semantics into schema mapping expressions, and use the

chase

procedure to transform source data into target schemas. An experimental study shows that our approach is effective and has good performance.

Guohui Ding, Guoren Wang

Expressive Power of Query Languages for Constraint Complex Value Databases

Motivated by constraint complex values which allow us to represent nested finitely representable sets, we study the expressive power of various query languages over constraint complex value databases. The tools we use come in the form of collapse results which are well established results in the context of first-order logic. We show that active-generic collapse carries over to second-order logic for structures with o-minimality and any relational signature in the complex value model. We also consider the problem of safety in the context of embedded finite complex value models and constraint complex value databases.

Hong-Cheu Liu

Scaling Up Query Allocation in the Presence of Autonomous Participants

In large-scale, heterogeneous information systems, mediators are widely used for query processing and the good operation of a system strongly depends on the way the mediator allocates queries. On the other hand, it is well known that a single mediator is a potential scalability and performance bottleneck as well as a single point of failure. Thus, multiple mediators should perform the query allocation process. This task is challenging in large-scale systems because participants typically have special interests that are not performance-related. Mediators should satisfy participants interests as if there was a single mediator in the system — i.e., with no, or almost no, additional network traffic. In this paper, we propose a virtual money-based query allocation method, called

VM

b

QA

, to perform query allocation in the presence of multiple mediators and autonomous participants. A key feature of

VM

b

QA

is that it allows a system to scale up to several mediators with no additional network cost. The results show that

VM

b

QA

significantly outperforms baseline methods from both satisfaction and performance points of view.

Joge-Arnulfo Quiané-Ruiz, Philippe Lamarre, Sylvie Cazalens, Patrick Valduriez

Generating Preview Instances for the Face Validation of Entity-Relationship Schemata: The Acyclic Case

We describe a mapping of Extended Entity-Relationship schemata to Answer Set Programming that allows us to generate informative example instances of the relational database that is implied by the conceptual schema. Such instances may be submitted to people involved in the requirements phase as a glimpse of what instances are allowed by the E-R schema at hand, thus enhancing database comprehension and so-called face validation.

Maria Amalfi, Alessandro Artale, Andrea Calì, Alessandro Provetti

Query Processing II

Dynamic Skylines Considering Range Queries

Dynamic skyline queries are practical in many applications. For example, if no data exist to fully satisfy a query

q

in an information system, the data “closer” to the requirements of

q

can be retrieved as answers. Finding the nearest neighbors of

q

can be a solution; yet finding the data not

dynamically dominated

by any other data with respect to

q

, i.e. the dynamic skyline regarding

q

can be another solution. A data point

p

is defined to dynamically dominate another data point

s

, if the distance between each dimension of

p

and the corresponding dimension of

q

is no larger than the corresponding distance regarding

s

and

q

and at least in one dimension, the corresponding distance regarding

p

and

q

is smaller than that regarding

s

and

q

. Some approaches for answering dynamic skyline queries have been proposed. However, the existing approaches only consider the query as a point rather than a range in each dimension, also frequently issued by users. We make the first attempt to solve a problem of computing dynamic skylines considering range queries in this paper. To deal with this problem, we propose an efficient algorithm based on the grid index and a novel variant of the well-known

Z

-order curve. Moreover, a series of experiments are performed to evaluate the proposed algorithm and the experiment results demonstrate that it is effective and efficient.

Wen-Chi Wang, En Tzu Wang, Arbee L. P. Chen

EcoTop: An Economic Model for Dynamic Processing of Top-k Queries in Mobile-P2P Networks

This work addresses the processing of top-

k

queries in mobile ad hoc peer to peer (M-P2P) networks using economic schemes. Our proposed economic model, designated as EcoTop, issues economic rewards to the mobile peers, which send

relevant

data items (i.e., those that contribute to the top-

k

query result), and penalizes peers for sending

irrelevant

items, thereby incentivizing the optimization of communication traffic. The main contributions of our work are three-fold. First, we propose the EcoTop economic model for efficient top-

k

query processing in M-P2P networks. Second, we propose two schemes, namely ETK and ETK+, for assigning rewards/penalties to peers and for enabling peers to re-evaluate the scores of their data items for item re-ranking purposes. Third, we conduct a performance study, which demonstrates that EcoTop is indeed effective in improving the performance of top-

k

queries, while minimizing the communication traffic. Notably, our novel economic incentive model also discourages free-riding in M-P2P networks.

Nilesh Padhariya, Anirban Mondal, Vikram Goyal, Roshan Shankar, Sanjay Kumar Madria

REQUEST: Region-Based Query Processing in Sensor Networks

In wireless sensor networks, node failures occur frequently. The effects of these failures can be reduced by using aggregated values of small groups of sensors instead of the values of individual sensors. However, most existing works have focused on computing the aggregation of all the nodes without grouping. Only a few approaches proposed the processing of grouped aggregate queries. However, since groups in their approaches are disjoint, some relevant regions to the query can be missing. In this paper, we propose

REQUEST

, region-based query processing in sensor networks. A region in our approach is defined as a maximal set of nodes located within a circle having a diameter specified in the query. To efficiently construct a large number of regions covering the entire monitoring field, we build the

SEC

(Smallest Enclosing Circle) index. Moreover, in order to process a region-based query, we adapt a hierarchical aggregation method, in which there is a leader for each region. To reduce the communication cost, we formulate an optimal leader selection problem and transform the problem into the set-cover problem. Finally, we devise a query-initiated routing tree for the communication between the leader and non-leader nodes in a region. In the experimental results, we show that the result of our region-based query is more reliable than that of the query which is based on individual nodes, and our processing method is more energy-efficient than existing methods for processing grouped aggregate queries.

Dong-Wan Choi, Chin-Wan Chung

Efficient Distributed Top-k Query Processing with Caching

Recently, there has been an increased interest in incorporating in database management systems rank-aware query operators, such as top-

k

queries, that allow users to retrieve only the most interesting data objects. In this paper, we propose a cache-based approach for efficiently supporting top-

k

queries in distributed database management systems. In large distributed systems, the query performance depends mainly on the network cost, measured as the number of tuples transmitted over the network. Ideally, only the

k

tuples that belong to the query result set should be transmitted. Nevertheless, a server cannot decide based only on its local data which tuples belong to the result set. Therefore, in this paper, we use caching of previous results to reduce the number of tuples that must be fetched over the network. To this end, our approach always delivers as many tuples as possible from cache and constructs a remainder query to fetch the remaining tuples. This is different from the existing distributed approaches that need to re-execute the entire top-

k

query when the cached entries are not sufficient to provide the result set. We demonstrate the feasibility and efficiency of our approach through implementation in a distributed database management system.

Norvald H. Ryeng, Akrivi Vlachou, Christos Doulkeridis, Kjetil Nørvåg

Exploiting Correlation to Rank Database Query Results

In recent years, effective ranking strategies for relational databases have been extensively studied. Existing approaches have adopted empirical term-weighting strategies called

tf

×

idf

(term frequency times inverse document frequency) schemes from the field of information retrieval (IR) without careful consideration of relational model. This paper proposes a novel ranking scheme that exploits the statistical correlations, which represent the underlying semantics of the relational model. We extend Bayesian network models to provide dependence structure in relational databases. Furthermore, a limited assumption of value independence is defined to relax the unrealistic execution cost of the probabilistic model. Experimental results show that our model is competitive in terms of efficiency without losing the quality of query results.

Jaehui Park, Sang-goo Lee

Indexing and High Performance

LinearDB: A Relational Approach to Make Data Warehouse Scale Like MapReduce

Operating on computer clusters, parallel databases enjoy enhanced performance. However, the scalability of a parallel database is limited by a number of factors. Although MapReduce-based systems are highly scalable, their performance is not satisfactory for data intensive applications. In this paper, we explore the feasibility of building a data warehouse that incorporates the best features from both technologies – the efficiency of parallel database and the scalability and fault tolerance of MapReduce. Towards this target, we design a prototype system called LinearDB. LinearDB organizes data in a decomposed snowflake schema and adopts three operations – transform, reduce and merge – to accomplish query processing. All these techniques are specially designed for the cluster environment. Our experimental results show that its scalability matches MapReduce and its performance is up to 3 times as good as that of PostgreSQL.

Huiju Wang, Xiongpai Qin, Yansong Zhang, Shan Wang, Zhanwei Wang

Genetic Algorithm Based QoS-Aware Service Compositions in Cloud Computing

Services in cloud computing can be categorized into two groups:

Application services

and

Utility Computing Services

. Compositions in the application level are similar to the Web service compositions in SOC (Service-Oriented Computing). Compositions in the utility level are similar to the task matching and scheduling in grid computing. Contributions of this paper include: 1) An extensible QoS model is proposed to calculate the QoS values of services in cloud computing. 2) A genetic-algorithm-based approach is proposed to compose services in cloud computing. 3) A comparison is presented between the proposed approach and other algorithms, i.e., exhaustive search algorithms and random selection algorithms.

Zhen Ye, Xiaofang Zhou, Athman Bouguettaya

Energy-Efficient Tree-Based Indexing Schemes for Information Retrieval in Wireless Data Broadcast

Mobile computing can be equipped with wireless devices to allow users retrieving information from anywhere at anytime. Recently, wireless data broadcast becomes a popular data dissemination method, especially for broadcasting public information to a large number of mobile subscribers at the same time.

Access Latency

and

Tuning Time

are two main criteria to evaluate the performance of a data broadcast system. Indexing can dramatically reduce tuning time by guiding clients to turn into doze mode while waiting for the data to arrive.

B

+

-Tree Distributed Index Scheme (BTD)

is a popular index scheme for wireless data broadcast, which has been extended by many research works. Among traditional index structures, alphabetic Huffman tree is another important tree-based index technique with the advantage of taking data’s access frequency into consideration. In this paper, we are trying to answer one question: can alphabetic Huffman tree replace B

+

-tree and provide better performance? To answer this question, we propose a novel

Huffman-Tree based Distributed Index Scheme (HTD)

and compare its performance with

BTD

based on a uniform communication environment. The performances of

HTD

and

BTD

are analyzed both theoretically and empirically. With the analysis result, we conclude that

HTD

outperforms

BTD

and can replace

BTD

in most existing wireless data broadcast system.

Jiaofei Zhong, Weili Wu, Yan Shi, Xiaofeng Gao

Buffer Cache De-duplication for Query Dispatch in Replicated Databases

We propose a buffer cache de-duplication technique for query dispatch in replicated databases. In the field of replicated databases, there is the well-known problem called ’Buffer Cache Duplication’ problem, which means that different buffer caches share some identical data. Unfortunately, existing approaches of de-duplication have shortcomings; the only SQL statements of queries (e.g. FROM and WHERE clauses) are insufficient to estimate exactly which data the queries reference for duplication-free dispatch. Our approach uses index access patterns to construct a look-up table that allows dispatchers to determine which database it should dispatch a query. We implement a prototype and demonstrate that under a certain condition around 90% of the duplication holds down to 12% in two databases, and it cuts down the referenced data on each buffer cache to approximately 40% in eight databases. Finally, we will discuss whether the condition can be applied to actual workloads.

Takeshi Yamamuro, Yoshiharu Suga, Naoya Kotani, Toshio Hitaka, Masashi Yamamuro

Indexing for Vector Projections

The ability to extract the most relevant information from a dataset is paramount when the dataset is large. For data arising from a numeric domain, a pervasive means of modelling the data is to represent it in the form of vectors. This enables a range of geometric techniques; this paper introduces projection as a natural and powerful means of scoring the relevancy of vectors. As yet, there are no effective indexing techniques for quickly retrieving those vectors in a dataset that have large projections onto a query vector. We address that gap by introducing the first indexing algorithms for vectors of arbitrary dimension, producing indices with strong sub-linear and output-sensitive worst-case query cost and linear data structure size guarantees in the I/O cost model. We improve this query cost markedly for the special case of two dimensions. The derivation of these algorithms results from the novel geometric insight that is presented in this paper, the concept of a data vector’s

cap

.

Sean Chester, Alex Thomo, S. Venkatesh, Sue Whitesides

Industrial Papers

Assessment of Cardiovascular Disease Risk Prediction Models: Evaluation Methods

This paper uses a real world anaesthesia time-series monitoring data in the prediction of cardiovascular disease risk in a manner similar to exercise electrocardiography. Models derived using the entire anaesthesia population and subgroups based on pre-anaesthesia likelihood of complications are compared in an attempt to ascertain which model performance measures are best suited to populations with differing pre-test probability of disease. Misclassification rate (MR) as a measure of prediction model performance was compared with Kappa statistic, sensitivity, specificity, positive and negative predictive values and area under the receiver operating characteristic curve (AUC). In this medical application of data mining, MR is shown to be dependent on the prevalence of disease within the population studied but AUC and Kappa statistic are shown to be independent of disease prevalence.

Richi Nayak, Ellen Pitt

Visual Analysis of Implicit Social Networks for Suspicious Behavior Detection

In this paper we show how social networks, implicitly built from communication data, can serve as a basis for suspicious behavior detection from large communications data (landlines and mobile phone calls) provided by communication services providers for criminal investigators following two procedures: lawful interception and data retention. We propose the following contributions: (i) a data model and a set of operators for querying this data in order to extract suspicious behavior and (ii) a user friendly and easy-to-navigate visual representation for communication data with a prototype implementation.

Amyn Bennamane, Hakim Hacid, Arnaud Ansiaux, Alain Cagnati

Compositional Information Extraction Methodology from Medical Reports

Currently health care industry is undergoing a huge expansion in different aspects. Advances in Clinical Informatics (CI) are an important part of this expansion process. One of the goals of CI is to apply Information Technology for better patient care service provision through two major applications namely electronic health care data management and information extraction from medical documents. In this paper we focus on the second application. For better management and fruitful use of information, it is necessary to contextually segregate important/relevant information buried in a huge corpus of unstructured texts. Hence Information Extraction (IE) from unstructured texts becomes a key technology in CI that deals with different sub-topics like extraction of biomedical entity and relations, passage/paragraph level information extraction, ontological study of diseases and treatments, summarization and topic identification etc. Though literature is promising for different IE tasks for individual topics, availability of an integrated approach for contextually relevant IE from medical documents is not apparent enough. To this end, we propose a compositional approach using integration of contextually (domain specific) constructed IE modules to improve knowledge support for patient care activity. The input to this composite system is free format medical case reports containing stage wise information corresponding to the evolution path of a patient care activity. The output is a compilation of various types of extracted information organized under different tags like past medical history, sign/symptoms, test and test results, diseases, treatment and follow up. The outcome is aimed to help the health care professionals in exploring a large corpus of medical case-studies and selecting only relevant component level information according to need/interest.

Pratibha Rani, Raghunath Reddy, Devika Mathur, Subhadip Bandyopadhyay, Arijit Laha

A Framework for Semantic Recommendations in Situational Applications

Information overload is an increasingly important concern as users access and generate steadily growing amounts of data. Besides, enterprise applications tend to grow more and more complex which hinders their usability and impacts business users’ productivity. Personalization and recommender systems can help address these issues, by predicting items of interest for a given user and enabling a better selection of the proposed information. Recommendations have become increasingly popular in web environments, with sites like Amazon, Netflix or Google News. However, little has been done so far to leverage recommendations in corporate settings. This paper presents our approach to integrate recommender systems in enterprise environments, taking into account their specific constraints. We present an extensible framework enabling heterogeneous recommendations, based on a semantic model of users’ situations and interactions. We illustrate this framework with a system suggesting structured queries and visualizations related to an unstructured document.

Raphaël Thollot, Marie-Aude Aufaure

Demo Papers

Storage and Use of Provenance Information for Relational Database Queries

In database querying, provenance information can help users understand where data comes from and how it is derived. Storing the provenance data is critical in the sense that, the storage cost should be as small as possible and of fine granularity, and it should support the user query on provenance tracking efficiently as well. In this demo, we have implemented a relational database system prototype which can support SQL-like query while supporting provenance data recording during query execution. In particular, we propose a tree structure to store provenance information and further propose various reduction strategies to optimize its storage cost; we support the functionality of provenance data tracking at tuple level for user queries in a visualized way.

Zhifeng Bao, Henning Koehler, Xiaofang Zhou, Tok Wang Ling

MRQSim: A Moving Range Query Simulation Platform in Spatial Networks

In this paper, we present our system MRQSim to efficiently conduct the continuous range query in the simulated road network environment when the query object moves. We propose two kinds of query types and consider complex simulation setups. The relative system architecture and demonstration are illustrated. Because incremental maintenance methods based on valid intervals are adopted and the potential uncertainty of objects is considered, MRQSim can obtain high efficiency with good adaptability to complex running scenarios.

Yu Gu, Na Guo, Chuanwen Li, Ge Yu

DWOBS: Data Warehouse Design from Ontology-Based Sources

In the past decades, data warehouse (DW) applications were built from traditional data sources. The availability of domain ontologies creates the opportunity for sources to explicit their semantics and to exploit them in many applications, including in data warehousing. In this paper, we present DWOBS, a case tool for ontological-based DW design based on domain ontologies. It takes as inputs (i) a set of sources referencing OWL ontology and (ii) a set of decisional requirements formulated using Sparql syntax on that ontology. DWOBS gives a semantic multidimensional model of the DW to be designed.

Selma Khouri, Ladjel Bellatreche

AUCWeb: A Prototype for Analyzing User-Created Web Data

In this demonstration, we present a prototype system, called

AUCWeb

, that is designed for analyzing user-created web data. It has novel features in that 1) it may utilize external resources for semantic annotation on low-quality user-created content, and 2) it provides a descriptive language for definition of analytical tasks. Both internal mechanism and the usage of

AUCWeb

for building advanced applications are to be shown in the demonstration.

Weining Qian, Feng Chen, Juan Du, Weiming Zhang, Can Zhang, Haixin Ma, Peng Cai, Minqi Zhou, Aoying Zhou

Blending OLAP Processing with Real-Time Data Streams

CEP and Databases share some characteristics but traditionally are treated as two separate worlds, one oriented towards real-time event processing and the later oriented towards long-term data management. However many real-time data and business intelligence analysis do need to confront streaming data with long-term stored one. For instance, how do current power consumption and power grid distribution compare with last year’s for the same day?

StreamNetFlux is a novel system that recently emerged to the market designed to integrate CEP and database functionalities. This blending allows the processing of queries that need such capabilities with top efficiency and scalability features.

João Costa, José Cecílio, Pedro Martins, Pedro Furtado

AutoBayesian: Developing Bayesian Networks Based on Text Mining

Bayesian network is a widely used tool for data analysis, modeling and decision support in various domains. There is a growing need for techniques and tools which can automatically construct Bayesian networks from massive text or literature data. In practice, Bayesian networks also need be updated when new data is observed, and literature mining is a very important source of new data after the initial network is constructed. Information closely related to Bayesian network usually includes the causal associations, statistics information and experimental results. However, these associations and numerical results cannot be directly integrated with the Bayesian network. The source of the literature and the perceived quality of research needs to be factored into the process of integration. In this demo, we will present a general methodology and toolkit called AutoBayesian that we developed to automatically build and update a Bayesian network based on the casual relationships derived from text mining.

Sandeep Raghuram, Yuni Xia, Jiaqi Ge, Mathew Palakal, Josette Jones, Dave Pecenka, Eric Tinsley, Jean Bandos, Jerry Geesaman

Classify Uncertain Data with Decision Tree

This demo presents a decision tree based classificationsystem for uncertain data. Decision tree is a commonlyused data classification technique. Tree learning algorithms cangenerate decision tree models from a training data set. Whenworking on uncertain data or probabilistic data, the learning andprediction algorithms need handle the uncertainty cautiously, orelse the decision tree could be unreliable and prediction resultsmay be wrong. In this demo,we will present DTU, a new decisiontree based classification and prediction system for uncertaindata. This tool uses new measures for constructing, pruningand optimizing decision tree. These new measures are computedconsidering uncertain data probability distribution functions.Based on the new measures, the optimal splitting attributes andsplitting values can be identified and used in the decision tree.We will show in this demo that DTU can process various typesof uncertainties and it has satisfactory classification performanceeven when data is highly uncertain.

Biao Qin, Yuni Xia, Rakesh Sathyesh, Jiaqi Ge, Sunil Probhakar

StreamFitter: A Real Time Linear Regression Analysis System for Continuous Data Streams

In this demo, we present the StreamFitter system for real-time linear regression analysis on continuous data streams. In order to perform regression on data streams, it is necessary to continuously update the regression model while receiving new data. In this demo, we will present two approaches for on-line, multi-dimensional linear regression analysis of stream data, namely Incremental Mathematical Stream Regression (IMSR) and Approximate Stream Regression (ASR). These methods dynamically recompute the regression model, considering not only the data records of the current window, but also the synopsis of the previous data. Therefore, the refined parameters more accurately model the entire data stream. The demo will show that the proposed methods are not only efficient in time and space, but also generate better fitted regression functions compared to the traditional sliding window methods and well-adapted to data changes.

Chandima Hewa Nadungodage, Yuni Xia, Fang Li, Jaehwan John Lee, Jiaqi Ge

Panel

Challenges in Managing and Mining Large, Heterogeneous Data

Success in various application domains including sensor networks, social networks, and multimedia, has ushered in a new era of information explosion. Despite the diversity of these domains, data acquired by applications in these domains are often voluminous, heterogeneous and containing much uncertainty. They share several common characteristics, which impose new challenges to storing, integrating, and processing these data, especially in the context of data outsourcing and cloud computing.

Some challenges include the following. First, autonomous data acquisition gives rise to privacy and security issues. Therefore, data management and mining must be elastic and privacy-conscious. Second, data is often dynamic and the trend in the data is often unpredictable. This calls for efficient incremental or cumulative algorithms for data management and mining. Load balancing and other real-time technologies are also indispensable for the task. Third, data repositories are distributed. Thus, gathering, coordinating, and integrating heterogeneous data in data management and mining will face unprecedented challenges.

This panel session gives researchers of different background and expertise an opportunity to address these challenging issues together. The main topics of this panel session target the themes in the interdisciplinary domains spreading across database, web, wireless data management, social networking, multimedia, and data outsourcing.

Haibo Hu, Haixun Wang, Baihua Zheng

Tutorials

Managing Social Image Tags: Methods and Applications

Social tags describe images from many aspects including the visual content observable from the images, the context and usage of images, user opinions and others. We present an overview of a tutorial on social image tags management - an approach to solve the management of tags associated with social images on the Web. Social image tags management defines a set of techniques for measuring effectiveness of tags in describing its annotated resources, discovering relationships between tags and how such knowledge is useful for various tag-based social media management applications such as tag recommendation, tag disambiguation and tag-based browsing systems. This tutorial offers an introduction to these issues and a synopsis of the state-of-the-art.

Aixin Sun, Sourav S. Bhowmick

Web Search and Browse Log Mining: Challenges, Methods, and Applications

Huge amounts of search log data have been accumulated in various search engines. Currently, a commercial search engine receives billions of queries and collects tera-bytes of log data on any single day. Other than search log data, browse logs can be collected by client-side browser plug-ins, which record the browse information if users’ permissions are granted. Such massive amounts of search/browse log data, on the one hand, provide great opportunities to mine the wisdom of crowds and improve search results as well as online advertisement. On the other hand, designing effective and efficient methods to clean, model, and process large scale log data also presents great challenges.

In this tutorial, I will focus on mining search and browse log data for search engines. I will start with an introduction of search and browse log data and an overview of frequently-used data summarization in log mining. I will then elaborate how log mining applications enhance the five major components of a search engine, namely, query understanding, document understanding, query-document matching, user understanding, and monitoring and feedbacks. For each aspect, I will survey the major tasks, fundamental principles, and state-of-the-art methods. Finally, I will discuss the challenges and future trends of log data mining.

Daxin Jiang

Searching, Analyzing and Exploring Databases

Keyword based search, analysis and exploration enables users to easily access databases without the need to learn a structured query language and to study possibly complex data schemas. Supporting keyword based search, analysis and exploration on databases has become an emerging hot area in database research and development due to its substantial benefit. Researchers from different disciplines are working together to tackle various challenges in this area.

This tutorial aims at outlining the problem space of supporting keyword based search, analysis and exploration on databases, introducing representative and state-of-the-art techniques that address different aspects of the problem, and discussing further challenges and potential future research directions. The tutorial will provide the researchers and developers a systematic and organized view on the techniques related to this topic.

A tutorial with similar topic was given in SIGMOD 2009 [1], and was very well received. Since the research interest in keyword search on structured data is ever increasing and there are plenty of new techniques since then, this tutorial will be updated to incorporate the new findings in this area, which covers query processing, type ahead search, query suggestion, personalization, result comparison, faceted search, etc.

Yi Chen, Wei Wang, Ziyang Liu

Springer Professional

About this book

Table of Contents

Frontmatter