Skip to main content

2010 | Buch

Database Systems for Advanced Applications

15th International Conference, DASFAA 2010, International Workshops: GDM, BenchmarX, MCIS, SNSMW, DIEW, UDM, Tsukuba, Japan, April 1-4, 2010, Revised Selected Papers

herausgegeben von: Masatoshi Yoshikawa, Xiaofeng Meng, Takayuki Yumoto, Qiang Ma, Lifeng Sun, Chiemi Watanabe

Verlag: Springer Berlin Heidelberg

Buchreihe : Lecture Notes in Computer Science

insite
SUCHEN

Inhaltsverzeichnis

Frontmatter

1st International Workshop on Graph Data Management: Techniques and Applications (GDM 2010)

GDM2010 Workshop Organizers’ Message

The graph is a powerful tool for representing and understanding objects and their relationships in various application domains. Recently, graphs have been widely used to model many complex structured and schemaless data such as semantic web, social networks, biological networks, chemical compounds, multimedia databases and business process models. The growing popularity of graph databases has generated interesting and fundamental data management problems which attracted a lot of attention from the database community such as: subgraph search queries, supergraph search queries, frequent subgraph mining and approximate subgraph matching. In principle, efficient management of large graph databases is a key performance issue in any graph-based application.

Sherif Sakr, Wei Wang
On-Line Preferential Nearest Neighbor Browsing in Large Attributed Graphs

Given a large weighted directed graph where nodes are associated with attributes and edges are weighted, we study a new problem, called preferential nearest neighbors (NN) browsing, in this paper. In such browsing, a user may provide one or more source nodes and some keywords to retrieve the nearest neighbors of those source nodes that contain the given keywords. For example, when a tourist has a plan to visit several places (source nodes), he/she would like to search hotels with some preferred features (e.g., Internet and swimming pools). It is highly desirable to recommend a list of near hotels with those preferred features, in order of the road network distance to the places (source nodes) the tourist wants to visit. The existing approach by graph traversal at querying time requires long query processing time, and the approach by maintenance of the pre-computed all-pairs shortest distances requires huge storage space on disk. In this paper, we propose new approaches to support on-line preferential NN browsing. The data graphs we are dealing with are weighted directed graphs where nodes are associated with attributes, and the distances between nodes to be found are the exact distances in the graph. We focus ourselves on two-step approaches. In the first step, we identify a number of reference nodes (also called centers) which exist alone on some shortest paths between a source node and a preferential NN node that contains the user-given keywords. In the second step, we find the preferential NN nodes within a certain distance to the source nodes via the relevant reference nodes, using an index that supports both textural (attributes) and and the distance. Our approach tightly integrates NN search with the preference search, which is confirmed to be efficient and effective to find any preferential NN nodes.

Jiefeng Cheng, Jeffrey Xu Yu, Reynold C. K. Cheng
Mining Useful Time Graph Patterns on Extensively Discussed Topics on the Web
(Position Paper)

Temporal characteristics of the web have been analyzed widely in recent years, but graph patterns have served important roles for analyzing the web’s structural characteristics. Although temporal characteristics of the web have not been estimated in previous patterns, we specifically examine a novel kind of pattern,

time graph patterns

, estimating time-series data including the creation times of pages and links. We find useful time graph patterns representing the process by which a topic is discussed extensively during a short period without manual investigations of web graphs. We have also analyzed the patterns and the web pages corresponding to the patterns. Three characteristic pages are contained in the patterns. Additionally, we have succeeded in finding a subgraph matching a mined pattern. We observed that the subgraph corresponds to an extensively discussed topic.

Taihei Oshino, Yasuhito Asano, Masatoshi Yoshikawa
Querying Graph-Based Repositories of Business Process Models

With the rapid and incremental increase in the number of process models developed by different process designers, it becomes crucial for business process designers to be able to look up the repository for models that could handle a similar situation before developing new ones. In this paper, we present an approach for querying repositories of graph-based business process models. Our approach is based on a visual query language for business processes called BPMN-Q. BPMN-Q is used to query business process models by matching the structure of a graph query to that of a process model. The query engine of our system is built on top of traditional RDBMS. We make use of the robust relational indexing infrastructure in order to achieve an efficient and scalable query evaluation performance.

Ahmed Awad, Sherif Sakr
SGDB – Simple Graph Database Optimized for Activation Spreading Computation

In this paper, we present SGDB, a graph database with a storage model optimized for computation of Spreading Activation (SA) queries. The primary goal of the system is to minimize the execution time of spreading activation algorithm over large graph structures stored on a persistent media; without pre-loading the whole graph into the memory. We propose a storage model aiming to minimize number of accesses to the storage media during execution of SA and we propose a graph query type for the activation spreading operation. Finally, we present the implementation and its performance characteristics in scope of our pilot application that uses the activation spreading over the Wikipedia link graph.

Marek Ciglan, Kjetil Nørvåg

Data Intensive e-Science Workshop (DIEW2010)

Introduction to the Data Intensive e-Science Workshop (DIEW) 2010

As the amount and the complexity of scientific data is rapidly increasing, data has become a key aspect in scientific research. Creating the computer infrastructure which enables scientists to extract scientific knowledge by linking, processing and analyzing these distributed and diverse data would be a crucial issue towards a fourth paradigm ? Data Intentive Scientific Discovery proposed by late Jim Gray. This e-Science infrastructure is also a basis for constructing digital repositories which can archive and share valuable knowledge among science communities.

Isao Kojima, Kento Aida, Geoffrey Fox, Neil Chue Hong, Masatoshi Ohishi
WISE-CAPS: Web-Based Interactive Secure Environment for Collaborative Analysis of Planetary Science

We are now developing Web-GIS based collaboration environment for lunar and planetary science. This system, called WISE-CAPS aims for promotion of researchers’ collaboration and data sharing through the network. In WISE-CAPS, all data are stored in server and data access to server is controlled with security modules of the server and control files. This system combines easy-to-use user environment and flexible and robust security.

Junya Terazono, Ryosuke Nakamura, Shinsuke Kodama, Naotaka Yamamoto, Hirohide Demura, Naru Hirata, Yoshiko Ogawa, Jun’ichi Haruyama, Makiko Ohtake, Tsuneo Matsunaga
Towards Large-Scale Scientific Dataspaces for e-Science Applications

This work intends to provide a large-scale scientific data management solution based on the concepts of dataspaces for e-Science applications. Our approach is to semantically enrich the existing relationship among primary and derived data items, and to preserve both relationships and data together within a dataspace to be reused by owners and others. To enable reuse, data must be well preserved. Preservation of scientific data can best be established if the full life cycle of data is addressed. This is challenged by the e-Science life cycle ontology, whose major goal is to trace semantics about procedures in scientific experiments. jSpace, a first prototype of a scientific dataspace support platform is implemented and deployed to an early core of adopters in the breath gas research domain from which specific use cases are derived. In this paper we describe the architecture, discuss a specific prototype implementation and outline the design concepts of a second prototype.

Ibrahim Elsayed, Peter Brezany
Providing Constructed Buildings Information by ASTER Satellite DEM Images and Web Contents

It has become easy to accumulate and to deliver scientific data by the evolution of computer technologies. The GEO Grid project has collected global satellite images from 2000 to present, and the amount of the collection is about 150 TB. It is required to generate new values by integrating satellite images with heterogeneous information such as Web contents or geographical data. Using GEO Grid satellite images, some researches detect feature changes such as earthquakes, fires and newly constructed building. In this paper, detections of feature changes from time series satellite image are referred to as events, and we focus on events about newly constructed buildings. Usually, there are articles about such newly constructed buildings on the Web. For example, a newly started shopping center is usually introduced in a news report, and a newly constructed apartment is often on the lips of neighboring residents. So, we propose an event detection system that extracts candidate events from satellite images, collects information about them from the Web, and integrates them. This system consists of an event detection module and a Web contents collection module. The event detection module detects geographical regions that have differences with elevation values between two satellite images which are temporally different. The expressions of regions are translated from latitudes/longitudes to building names by using an inverse geocoder. Then, the contents collection module collects Web pages by querying names of buildings to a search engine. The collected pages are re-ranked based on temporal information which is close to event occurrence time. We developed a prototype system. The result of evaluation showed that the system detected some information of building construction events with appropriate web contents in Tsukuba, Japan.

Takashi Takagi, Hideyuki Kawashima, Toshiyuki Amagasa, Hiroyuki Kitagawa
Gfdnavi, Web-Based Data and Knowledge Server Software for Geophysical Fluid Sciences, Part I: Rationales, Stand-Alone Features, and Supporting Knowledge Documentation Linked to Data

In recent years, many data centers and research groups provide data on geophysical fluids such as the atmosphere and oceans through the Internet along with on-line visualization. However, their services are not available once data files are downloaded. This paper presents open-source software named Gfdnavi developed to reduce the limitation and to support data handling beyond initial “quick-looks”. Gfdnavi extracts metadata from scientific data and stores them in a database. They can be accessed with web browsers for search, analysis, and visualization. It supports a wide range of usage such as public data services, group data management, and desktop use. As its unique feature, Gfdnavi supports writing and archiving documents based on knowledge obtained through data analysis. The documents are linked with the original data and analysis/visualization procedures. It has a wide variety of applications such as interdisciplinary- and collaborative-study support, a realization of falsifiability, and educational use.

Takeshi Horinouchi, Seiya Nishizawa, Chiemi Watanabe, Akinori Tomobayashi, Shigenori Otsuka, Tsuyoshi Koshiro, Yoshi-Yuki Hayashi, GFD Dennou Club
Gfdnavi, Web-Based Data and Knowledge Server Software for Geophysical Fluid Sciences, Part II: RESTful Web Services and Object-Oriented Programming Interface

In recent years, increasing amounts of scientific data on geophysical and environmental fluids, e.g., in the atmosphere and oceans, are being available. Further, there is increasing demand for web-based data services. Several browser-based data servers, on which geophysical-fluid data can be analyzed and visualized, have been developed. However, they are suitable only for initial “quick-looks” and not for subsequent research processes. As a solution, we developed data server software named Gfdnavi. One of its important features is that it provides extensive support for programming (scripting). With Gfdnavi, users can easily switch between operations using a web browser and operations using scripts or command lines. This paper describes its network features: web services, which is an important part of Gfdnavi’s programmability, and the functionality to search across multiple Gfdnavi servers. To develop the web services, we adopted the REST architecture. We also developed a client library to ensure access to web services in the programming language Ruby. Using this library, data can be analyzed and visualized on either the server side or client side. It also enables data handling on multiple servers. Search across multiple web servers is made possible by a simple peer-to-peer network with a central server, with the peer-to-peer communication based on web services.

Seiya Nishizawa, Takeshi Horinouchi, Chiemi Watanabe, Yuka Isamoto, Akinori Tomobayashi, Shigenori Otsuka, GFD Dennou Club

3rd International Workshop on Managing Data Quality in Collaborative Information Systems (MCIS2010)

MCIS2010 Workshop Organizers’ Message

In today’s global information sharing environments, poor data quality is known to compromise the credibility and efficiency of commercial as well as public endeavours. Several developments from industry as well as academia have contributed significantly towards addressing the problem.

Shazia Sadiq, Xiaochun Yang, Xiaofang Zhou, Ke Deng
Checking Structural Integrity for Metadata Repository Systems by Means of Description Logics

The organization of the metadata in repository systems exhibits a complex structure which is layered, multi-level and dynamically adaptable; it is insufficiently specified in existing repository system standard how to ensure structural integrity, the above two reasons lead to the violation of structural integrity frequently during the creation of the metadata structure based on Meta Object Facility(MOF), thus affect the stability of repository systems. However, structural integrity checking for repository systems based on MOF is difficult because MOF is rendered to users by graphs, which lack precise semantics. In this paper, we try to solve this problem by means of Description Logics (DLs). The approach is based on a particular formal logic of the family of Description Logics. We make a study of how to formalize the different levels of MOF architecture into the DL knowledge base and how to check inconsistencies automatically using query and reasoning mechanism provided by the Description Logic. We perform performance evaluation for structural integrity checking prototypical system implemented in terms of the approach, the results are encouraging.

Xiaofei Zhao, Zhiqiu Huang
On Memory and I/O Efficient Duplication Detection for Multiple Self-clean Data Sources

In this paper, we propose efficient algorithms for duplicate detection from multiple data sources that are themselves duplicate-free. When developing these algorithms, we take the full consideration of various possible cases given the workload of data sources to be cleaned and the available memory. These algorithms are memory and I/O efficient, being able to reduce the number of pair-wise record comparison and minimize the total page access cost involved in the cleaning process. Experimental evaluation demonstrates that the algorithms we propose are efficient and are able to achieve better performance than SNM and random access methods.

Ji Zhang, Yanfeng Shu, Hua Wang
Top-K Generation of Mediated Schemas over Multiple Data Sources

Schema integration has been widely used in many database applications, such as Data Warehousing, Life Science and Ontology Merging. Though schema integration has been intensively studied in recent yeas, it is still a challenging issue, because it is almost impossible to find the perfect target schema. An automatic method to schema integration, which explores multiple possible integrated schemas over a set of source schemas from the same domain, is proposed in this paper. Firstly, the concept graph is introduced to represent the source schemas at a higher-level of abstraction. Secondly, we divide the similarity between concepts into intervals to generate three merging strategies for schemas. Finally, we design a novel top-

k

ranking algorithm for the automatic generation of the best candidate mediated schemas. The key component of our algorithm is the pruning technique which uses the ordered buffer and the threshold to filter out the candidates. The extensive experimental studies show that our algorithm is effective and runs in polynomial time.

Guohui Ding, Guoren Wang, Bin Wang
A Graphical Method for Reference Reconciliation

In many applications several references may refer to one real entity, the task of reference reconciliation is to group those references into several clusters so that each cluster associates with only one real entity. In this paper we propose a new method for reference reconciliation, that is, in addition to the traditional attribute values similarity, we employ the record-level relationships to compute the association similarity values of references in graphs, then we combine this kind of similarity with the traditional attribute values similarity and use the clustering algorithm to group the closest references.

Zheng Yongqing, Kong Qing, Dong Guoqing

2nd International Workshop on Benchmarking of Database Management Systems and Data-Oriented Web Technologies (BenchmarX’10)

BenchmarX’10 Workshop Organizers’ Message

The 2nd International Workshop on Benchmarking of Database Management Systems and Data-Oriented Web Technologies (BenchmarX’10) was held on April 4, 2010 at the University of Tsukuba, Japan in conjunction with the 15th International Conference on Database Systems for Advanced Applications (DASFAA’10). It was organized by Irena Mlynkova, Martin Necasky and Jiri Dokulil from the Department of Software Engineering of the Charles University in Prague, Czech Republic.

Irena Mlýnková, Martin Nečaský, Jiří Dokulil
Benchmarking Holistic Approaches to XML Tree Pattern Query Processing
(Extended Abstract of Invited Talk)

In this talk I outlined and surveyed some developments in the field of XML tree pattern query processing, especially focussing on holistic approaches. XML tree pattern query (TPQ) processing is a research stream within XML data management that focuses on efficient TPQ answering. With the increasing popularity of XML for data representation, there is a lot of interest in query processing over data that conforms to a tree-structured data model. Queries on XML data are commonly expressed in the form of tree patterns (or twig patterns), which represent a very useful subset of XPath and XQuery. Efficiently finding all tree pattern matches in an XML database is a major concern of XML query processing. In the past few years, many algorithms have been proposed to match such tree patterns. In the talk, I presented an overview of the state of the art in TPQ processing. This overview shall start by providing some background in holistic approaches to process TPQ and then introduce different algorithms and finally present benchmark datasets and experiments.

Jiaheng Lu
Benchmarking the Compression of XML Node Streams

In recent years, many approaches to XML twig pattern query processing have been developed. Holistic approaches are particularly significant in that they provide a theoretical model for optimal processing of some query classes and have very low main memory complexity. Holistic algorithms are supported by a stream abstract data type. This data type is usually implemented using inverted lists or special purpose data structures. In this article, we focus on an efficient implementation of a stream ADT. We utilize previously proposed fast decoding algorithms for some prefix variable-length codes, like Elias-delta, Fibonacci of order 2 and 3 as well as Elias-Fibonacci codes. We compare the efficiency of the access to a stream using various decompression algorithms. These results are compared with the result of data structures where no compression is used. We show that the compression improves the efficiency of XML query processing.

Radim Bača, Jiří Walder, Martin Pawlas, Michal Krátký
Generation of Synthetic XML for Evaluation of Hybrid XML Systems

Hybrid XML storage offers a large number of alternative shredding choices. In order to automatically determine optimal shredding strategies it is crucial to have an insight into how the structure of a XML data set affects the performance. Since the structure can take many forms and the number of possible mappings is huge it is important to gain insights on the relation between structure and performance for formats that are actually used. By taking real-world data sets and modify the structure in steps you can see how the performance and other measurable properties change. We describe how a data generator can be used to produce a synthetic data set based on an existing data set, by using four different models. We compare the performance on the original data set with the performance on the different synthetic models.

David Hall, Lena Strömbäck
Benchmarking Publish/Subscribe-Based Messaging Systems

Publish/subscribe-based messaging systems are used increasingly often as a communication mechanism in data-oriented web applications. Such applications often pose serious performance and scalability challenges. To address these challenges, it is important that systems are tested using benchmarks to evaluate their performance and scalability before they are put into production. In this paper, we present

jms2009-PS

, a new benchmark for publish/subscribe-based messaging systems built on top of the SPECjms2007 standard workload. We introduce the benchmark and discuss its configuration parameters showing how the workload can be customized to evaluate various aspects of publish/subscribe communication. Finally, we present a case study illustrating how the benchmark can be used for performance analysis of messaging servers.

Kai Sachs, Stefan Appel, Samuel Kounev, Alejandro Buchmann
An Experimental Evaluation of Relational RDF Storage and Querying Techniques

The Resource Description Framework (RDF) is a flexible model for representing information about resources in the web. With the increasing amount of RDF data which is becoming available, efficient and scalable management of RDF data has become a fundamental challenge to achieve the Semantic Web vision. The RDF model has attracted a lot of attention of the database community and many researchers have proposed different solutions to store and query RDF data efficiently. In this paper, we focus on evaluating the state-of-the-art of the approaches which are relying on the relational infrastructure to provide scalable engines to store and query RDF data. Our experimental evaluation is done on top of recently introduced SP

2

Bench performance benchmark for RDF query engines. The results of our experiments shows that there is still room for optimization in the proposed generic relational RDF storage schemes and thus new techniques for storing and querying RDF data are still required to bring forward the Semantic Web vision.

Hooran MahmoudiNasab, Sherif Sakr
Analyzer: A Framework for File Analysis

This paper aims to introduce

Analyzer

– a complete framework for performing statistical analyses of real-world documents. Exploitation of results of these analyses is a classical way how data processing can be optimized in many areas. Although this intent is legitimate, ad hoc and dedicated analyses soon become obsolete, they are usually built on insufficiently extensive collections and are difficult to repeat.

Analyzer

represents an easily extensible framework, which helps the user with gathering documents, managing analyses and browsing computed reports. This paper particularly attempts to discuss proposed analyses model, standard application usage and features, and also basic aspects of

Analyzer

architecture and implementation.

Martin Svoboda, Jakub Stárka, Jan Sochna, Jiří Schejbal, Irena Mlýnková

Workshop on Social Networks and Social Media Mining on the Web (SNSMW2010)

SNSMW 2010 Workshop Organizers’ Message

The First International Workshop on Social Networks and Social Media Mining on theWeb (SNSMW 2010) was held in Tsukuba, Japan on 4 April, 2010, in conjunction with the DASFAA 2010 conference. The aim of the workshop is to provide an opportunity to present papers on social networks and social media mining.

Yoshinori Hijikata, Guandong Xu
Task-Oriented User Modeling Method and Its Application to Service Navigation on the Web

Value of information accumulated on the Web should be enhanced if it is provided to the user who just faces to a problematic situation which can be solved by the information. The authors have been investigating a task-oriented menu, which enables users to search for mobile internet services not by category but by situation of the users. Construction of the task-oriented menu is based on a user modeling method which supports descriptions of user activities, such as task execution and defeating obstacles encountered during the task, which in turn represents users’ situations and/or needs for certain information. We have built task models of the mobile users which covered about 97% of the assumed situations of mobile internet services. Then we reorganized “contexts” in the model and designed a menu hierarchy from the view point of the task. We have linked the designed menu to the set of actual mobile internet service sites included in the i-mode service operated by NTT docomo, consists of 5016 services. Among them, 4817 services are properly connected to the menu. This paper introduces a framework for real scale task-oriented menu system for mobile service navigation with its relations to the SNS applications as knowledge resources.

Munehiko Sasajima, Yoshinobu Kitamura, Riichiro Mizoguchi
Tag Disambiguation through Flickr and Wikipedia

Given the popularity of social tagging systems and the limitations these systems have, due to lack of any structure, a common issue that arises involves the low retrieval quality in such systems due to ambiguities of certain terms. In this paper, an approach for improving the retrieval in these systems, in case of ambiguous terms, is presented that attempts to perform tag disambiguation and, at the same time, provide users with relevant content. The idea is based on a mashup that combines data and functionality of two major web 2.0 sites, namely Flickr and Wikipedia and aims at enhancing content retrieval for web users. A case study with the ambiguous notion “Apple” illustrates the value of the proposed approach.

Anastasia Stampouli, Eirini Giannakidou, Athena Vakali
Measuring Attention Intensity to Web Pages Based on Specificity of Social Tags

Social bookmarks are used to find Web pages drawing much attention. However, tendency of pages to collect bookmarks is different by their topic. Therefore, the number of bookmarks can be used to know attention intensity to pages but it cannot be used as the metric of the intensity itself. We define the relative quantity of social bookmarks (RQS) for measuring the attention intensity to a Web page. The RQS is calculated using the number of social bookmarks of related pages. Related pages are found using similarity based on specificity of social tags. We define two types of specificity, local specificity, which is the specificity for a user, and global, which is the specificity common in a social bookmark service.

Takayuki Yumoto, Kazutoshi Sumiya
SQL as a Mashup Tool: Design and Implementation of a Web Service Integration Approach Based on the Concept of Extensible Relational Database Management Systems

Recently Web services based on the HTTP and XML technology have become widely used, and application programs or mashups integrating such services have also proliferated. In order to support the integration process, we have developed a tool to convert user-defined function specifications written in XML to the corresponding user-defined functions loadable to PostgreSQL, an extensible relational database management system. With this conversion layer provided by the tool, multiple Web services and relational databases can be integrated in SQL, and therefore, integrated applications or mashups can be built quickly without being bothered by variety of Web service interfaces defining the calling sequences and result data types. Moreover, even if one of the Web service providers that a particular mashup depends on changes its service specification, the application does not need to be modified accordingly, when the corresponding function’s calling interface remains intact by changing the specification of the function body. This property which we refer to as Web service independence is quite important since Web service interfaces may change without any previous notices, and so some management methodology is needed for mashups.

Yoshihiko Ichikawa, Yuuki Matsui, Minoru Tanaka
Design of Impression Scales for Assessing Impressions of News Articles

This paper focuses on the impressions that people get from reading articles in newspapers. We have already proposed web application systems that extract and use several types of impressions from news articles. However, the types of impressions extracted and used in these systems were intuitively defined by us on the basis of a basic emotion model, which the well-known psychologist Robert Plutchik proposed to represent human emotions. That is, the characteristics of news articles that result in different impressions have not been taken into consideration in much detail. Therefore, we have tried to design one or more impression scales suitable for assessing impressions generated by news articles. First, we conducted nine experiments in each of which 100 people read ten news articles and indicated their impressions on 42 five-point scales, where 42 impression-related words such as “happy” and “strained” were assigned for the 42 scales. Consequently, we obtained impression-estimation data for the 42 impression-related words. Next, we applied factor and cluster analysis to these impression-estimation data, and analyzed similarities among the impression-related words in terms of their scores. In our results, the words that convey similar impressions are classified into a single group and the words that convey opposite impressions are classified into different groups of words. Finally, we designed six impression scales suitable for assessing impressions generated by news articles on the basis of these results, each of which consisted of two contrasting groups of impression-related words.

Tadahiko Kumamoto
An Evaluation Framework for Analytical Methods of Integrating Electronic Word-of-Mouth Information: Position Paper

This paper presents an evaluation framework for analytical methods of integrating eWOM Information. This framework involves a communication model that assumes a set of human subjective probabilities called an

belief source

and includes two translation processes: (1) encoding the belief source into a representation to communicate with a computer; these encoded messages are called

eWOM messages

, and (2) in the computer, decoding the eWOM messages to estimate the probabilities in the belief source. The efficiency of reducing the difficulty of describing the belief source and the accuracy of reconstructing the belief source are quantitated using this model. The evaluation processes are illustrated with an analytical method of integrating eWOM messages for probabilistic classification problems.

Kazunori Fujimoto
A Framework for Finding Community in Complex Networks

There is an increasing number of researches of complex networks such as the WWW, social networks and biological networks. One of the hot topics in this area is community detection. Nodes belonging to a community are likely to have common properties. For instance, in the WWW, a community may be a set of pages which belong to a same topic. Community structure is undoubtedly a key characteristic of complex networks. In this paper, we propose a new framework for finding communities in complex networks.This framework uses the idea of intersection graph and uses semantic information such as texts and attributes which appear in networks.

Naoki Okada, Kyohei Tanikawa, Yoshinori Hijikata, Shogo Nishida
C&C: An Effective Algorithm for Extracting Web Community Cores

Communities is a significant pattern of the Web. A community is a group of pages related to a common topic. Web communities are able to be characterized by dense bipartite subgraphs. Each community almost surely contains at least one core. A core is a complete bipartite graph (CBG). Focusing on the issues of extracting such community cores from the Web, in this paper we propose an effective

C&C

algorithm based on

combination

and

consolidation

to extract all embedded cores in web graphs. Experiments on real and large data collections demonstrate that the proposed algorithm

C&C

is efficient and effective for the community core extraction because: 1) all the largest emerging cores can be identified; 2) identifying all the embedded cores with different sizes only requires one-pass execution of

C&C

; 3) the extraction process needs no user-determined parameters in

C&C

.

Xianchao Zhang, Yueting Li, Wenxin Liang
Extracting Local Web Communities Using Lexical Similarity

The World Wide Web contains rich textual contents that are interconnected via complex hyperlinks. Most studies on web community extraction only focus on graph structures. Consequently, web communities are discovered purely in terms of explicit link information without considering textual properties of web pages. This paper proposes an improved algorithm based on Flake’s method using the maximum flow algorithm. The improved algorithm considers the differences between edges in terms of importance, and assigns a well-designed capacity to each edge via the lexical similarity of web pages. Given a specific query, it also lends itself to a new and efficient ranking scheme for members in the extracted community. The experimental results indicate that our approach efficiently handles a variety of data sets across a novel optimization strategy of similarity computation.

Xianchao Zhang, Wen Xu, Wenxin Liang
An Improved Algorithm for Extracting Research Communities from Bibliographic Data

In this paper we improve the performance of the community extraction algorithm in [1] from bibliographic data, which was originally proposed for web community discovery by [2]. A web community is considered to be a set of web pages holding a common topic, in other words, it is a dense subgraph induced in web graph. Such subgraphs obtained by the max-flow algorithm are called

max-flow communities

, and this algorithm was improved to obtain research communities from bibliographic data by the strategy for selection of community nodes in [1]. We propose an improvement of this algorithm by carefully selecting initial seed node, and show the performance of this algorithm by experiments for the list of many keywords frequently appearing in data.

Yushi Nakamura, Toshihiko Horiike, Yoshimasa Taira, Hiroshi Sakamoto
Proposal of Deleting Plots from the Reviews to the Items with Stories

Recently, there are a lot of commercial web sites in which users can write reviews. Many people see the reviews of an item they are interested in. The opinions in reviews are useful when they want to measure whether a certain item is good. However, some reviews about items that contains stories (e.g. novels, movies, and computer games) have spoilers (undesirable descriptions of the story) as well as the opinions of review authors (reviewers). If users see a review involving spoilers, they might feel less interesting when you read or watch the item. We try to make a system that helps users see reviews without seeing plots.

Kaori Ikeda, Yoshinori Hijikata, Shogo Nishida
Basic Study on a Recommendation Method Considering Region-Restrictedness of Spots

We propose a recommendation method that considers regionrestrictedness. In this study, we define a spot as an establishment such as a restaurant, amusement facility, or tourist attraction in the real world. A spot with high region-restrictedness indicates that the spot is located in a restricted area but not in a user’s home area. We define the regionrestrictedness score to extract region-restricted phrases from text data about spots (such as promotional descriptions about spots). Then, spots including these phrases are recommended to the user. In this paper, we present our proposed method and discuss it on the basis of basic experimental results.

Kenta Oku, Fumio Hattori
A Hybrid Recommendation Method with Double SVD Reduction

An issue related to recommendation is the requirement of considerable memory for calculating the recommendation score. We propose a hybrid information recommendation method using singular value decomposition (SVD) to reduce data size for calculation. This method combines two steps. First, the method reduces the number of documents on the basis of the users’ rating pattern by applying SVD based on collaborative filtering (CF). Second, it reduces the number of terms on the basis of the term frequency pattern of the reduced documents by applying SVD based on content-based filtering (CBF). The experimental results show that the proposed method has almost the same mean absolute error (MAE) as the SVD-based CBF. Originally, our data set has 9924 terms. The SVD-based CBF reduces the number of terms to 45 and the proposed method to 15 while preserving the same MAE. This means that the proposed method is effective for calculating recommendation.

Yusuke Ariyoshi, Junzo Kamahara
Monitoring Geo-social Activities through Micro-blogging Sites

Radio Frequency Identification (RFID) comprises of uniquely identifiable, less expensive tags and readers that monitor these tags through Radio Frequency signals. The information in the tags will be collected with the help of the readers. The load for any reader is the number of tags that the reader has to monitor. For the effectiveness of the RFID system several algorithms like Load Balancing algorithm and Redundant Reader Elimination algorithm were proposed. In these existing schemes the former concentrates on the load of the reader while the latter concentrates on reducing the power consumption by the RFID system. Here in Optimal Solution for RFID load balancing, a solution for optimizing the effectiveness of the RFID system by focusing on two parameters is provided. Load of the reader and Power consumption by the RFID readers are the parameters considered. Here the maximum number of readers that can be switched off without leaving any of the readers overloaded is found out. And also the relocation of the tags from the reader which got overloaded to the least loaded reader in the RFID system increases the overall performance of the RFID system for the same power consumption. A mobile agent, RFID Mobile Agent (RMA) is implemented to collect workload information, from RFID middleware embedded in each RFID reader and execute load balancing strategy for RFID middleware.

Tatsuya Fujisaka, Ryong Lee, Kazutoshi Sumiya

The 2nd International Workshop on Ubiquitous Data Management(UDM2010)

UDM2010 Workshop Organizers’ Message

Ubiquitous computing technologies provide a pervasive base for a real world environment such that we can acquire and deliver information at every place. Advances in the technologies of displays, electronic papers, digital architecture, sensors, RFID tags and storage devices etc. may bring a new real world environment for data access.

Katsumi Tanaka, Yutaka Kidawara, Ki-Joune Li
Distributed SLCA-Based XML Keyword Search by Map-Reduce

Large scales of XML information comes continually from new Web applications, and SLCA (Smallest Lowest Common Ancestor)-based XML keyword search is one of the most important information retrieval approaches. Previous approaches focus on building index for XML documents. However in information dissemination scenario, it is impossible to build index in advance for continuous XML document streams. This paper addresses SLCA-based keyword search for continuous XML documents by Map-Reduce mechanism. We use parallel algorithms to process plenty of XML documents in Hadoop environment. A distributed SLCA computation method is designed, where each net node computes SLCA independently and just a little information needs be transmitted. A real Hadoop environment is built and we demonstrate the efficiency of our algorithms analytically and experimentally.

Chenjing Zhang, Qiang Ma, Xiaoling Wang, Aoying Zhou
FVC: A Feature-Vector-Based Classification for XML Dissemination

With the adoption of XML in a wide range of applications, efficient XML classification has become an important research topic. In current studies, users’ interests are expressed by XPath or XQuery queries. However, such a query is hard to formulate, because it requires a good knowledge of the structure and contents of the documents that will arrive and some knowledge of XQuery which few consumers will have. The query may even be impossible to formulate in cases where the distinction of relevant and irrelevant documents requires the consideration of a large number of features. Traditional classification method can’t work well for XML dissemination, because the number of training example is often small. Therefore, this paper introduces a data mining approach to XML dissemination that uses a given document collection of the user to automatically learn a classifier modelling his/her information needs. We present a novel XML classifier taking into account the structure as well as the content of XML documents. Our experimental evaluation on several real XML document sets demonstrates the accuracy and efficiency of the proposed XML classification approach.

Xiaoling Wang, Ester Martin, Weining Qian, Aoying Zhou
An Object-Field Perspective Data Model for Moving Geographic Phenomena

We propose a new data model to represent dynamic and continuous geographic phenomena over spatiotemporal domain in moving-object databases. Existing data models of moving objects have shortcomings with respect to the representation of moving geographic phenomena involving continuous fields, such as temperature, elevation, and the degree of pollution. Moreover, in the case of a spatiotemporal model for continuous fields, it is difficult to deal with continuous movements through time. In this paper, we define a data type called moving field to represent both object-based and field-based views of geographic phenomena in spatiotemporal domains. The main feature of our model is that it integrates the spatial field model with the slice representation of moving objects. By introducing moving fields, we provide a new computational environment for analyzing various moving phenomena with numerical as well as geographic processing.

K. -S. Kim, Y. Kiyoki
GRAMS3: An Efficient Framework for XML Structural Similarity Search

Structural similarity search is a fundamental technology for XML data management. However, existing methods do not scale well with large volume of XML document. The

pq

-gram is an efficient way of extracting substructure from the tree-structured data for approximate structural similarity search. In this paper, we propose an effective framework GRAMS

3

for evaluating structural similarity of XML data. First

pq

-grams of XML document are extracted; then we study the characteristics of

pq

-gram of XML and generate doc-gram vector using TGF-IGF model for XML tree; finally we employ locality sensitive hashing for efficiently structural similarity search of XML documents. An empirical study using both synthetic and real datasets demonstrates the framework is efficient.

Peisen Yuan, Xiaoling Wang, Chaofeng Sha, Ming Gao, Aoying Zhou
An Asynchronous Message-Based Knowledge Communication in a Ubiquitous Environment

This paper presents the required operational logic for relaying user requests from traditional Server/Client-based systems to services or additional information sources that require asynchronous communications. The paper describes a simple syntax for user-generated messages that can be used to determine where the messages should be forwarded. The paper also explains how replies to these messages should be processed. In the scope of this paper the Knowledge Grid works as the external source of information, but the same principle could be applied to any information source.

Petri Rantanen, Pekka Sillberg, Hannu Jaakkola, Takafumi Nakanishi
Providing Scalable Data Services in Ubiquitous Networks

Topology is a fundamental part of a network that governs connectivity between nodes, the amount of data flow and the efficiency of data flow between nodes. In traditional networks, due to physical limitations, topology remains static for the course of the network operation. Ubiquitous data networks (UDNs), alternatively, are more adaptive and can be configured for changes in their topology. This flexibility in controlling their topology makes them very appealing and an attractive medium for supporting “anywhere, any place” communication. However, it raises the problem of designing a dynamic topology. The dynamic topology design problem is of particular interest to application service providers who need to provide cost-effective data services on a ubiquitous network. In this paper we describe algorithms that decide when and how the topology should be reconfigured in response to a change in the data communication requirements of the network. In particular, we describe and compare a greedy algorithm, which is often used for topology reconfiguration, with a non-greedy algorithm based on metrical task systems. Experiments show the algorithm based on metrical task system has comparable performance to the greedy algorithm at a much lower reconfiguration cost.

Tanu Malik, Raghvendra Prasad, Sanket Patil, Amitabh Chaudhary, Venkat Venkatasubramanian
On-Demand Data Broadcasting for Data Items with Time Constraints on Multiple Broadcast Channels

Data Broadcasting is an effective approach to provide information to a large group of clients in ubiquitous environments. How to generate the data broadcast schedule to make the average waiting time short for clients is an important issue. In particular, when the data access pattern is dynamic and data have time constraints, such as traffic and stock information, scheduling the broadcast for such data to fulfill the requests becomes challenging. Since the content of the broadcast is dynamic and the request deadlines should be met, such data broadcasting is referred to as on-demand data broadcasting with time constraints. Many related papers discussed this type of data broadcasting with a single broadcast channel. In this paper, we investigate how to schedule the on-demand broadcast for the data with time constraints using multiple broadcast channels and provide two heuristics to schedule the data broadcast. The objective of the proposed heuristics is to minimize the miss rate (i.e., ratio of the requests missing deadlines to all the requests) and latency (i.e., time between issuing and termination of the request). More discussion about the proposed heuristics is given through extensive simulation experiments. The experimental results validate that the proposed heuristics achieve the objectives.

Ta-Chih Su, Chuan-Ming Liu
Backmatter
Metadaten
Titel
Database Systems for Advanced Applications
herausgegeben von
Masatoshi Yoshikawa
Xiaofeng Meng
Takayuki Yumoto
Qiang Ma
Lifeng Sun
Chiemi Watanabe
Copyright-Jahr
2010
Verlag
Springer Berlin Heidelberg
Electronic ISBN
978-3-642-14589-6
Print ISBN
978-3-642-14588-9
DOI
https://doi.org/10.1007/978-3-642-14589-6

Premium Partner