Skip to main content

About this book

This book constitutes the refereed proceedings of the 10th International Workshop on Databases in Networked Information Systems, DNIS 2015, held in Aizu-Wakamatsu, Japan, March 2015.

The 14 revised full papers presented together with 7 invited papers were carefully reviewed and selected from numerous submissions. The papers are organized in topical sections on big data analysis, information and knowledge management, business data analytics and visualization, networked information resources, and business data analytics in astronomy and sciences.

Table of Contents


Big Data Analysis

The Big Data Landscape: Hurdles and Opportunities

Big Data provides an opportunity to interrogate some of the deepest scientific mysteries, e.g., how the brain works and develop new technologies, like driverless cars which, till very recently, were more in the realm of science fiction than reality. However Big Data as an entity in its own right creates several computational and statistical challenges in algorithm, systems and machine learning design that need to be addressed. In this paper we survey the Big Data landscape and map out the hurdles that must be overcome and opportunities that can be exploited in this paradigm shifting phenomenon.
Divyakant Agrawal, Sanjay Chawla

Discovering Chronic-Frequent Patterns in Transactional Databases

This paper investigates the partial periodic behavior of the frequent patterns in a transactional database, and introduces a new class of user-interest-based patterns known as chronic-frequent patterns. Informally, a frequent pattern is said to be chronic if it has sufficient number of cyclic repetitions in a database. The proposed patterns can provide useful information to the users in many real-life applications. An example is finding chronic diseases in a medical database. The chronic-frequent patterns satisfy the anti-monotonic property. This property makes the pattern mining practicable in real-world applications. The existing pattern growth techniques that are meant to discover frequent patterns cannot be used for finding the chronic-frequent patterns. The reason is that the tree structure employed by these techniques’ capture only the frequency and disregards the periodic behavior of the patterns. We introduce another pattern-growth algorithm which employs an alternative tree structure, called Chronic-Frequent pattern tree (CFP-tree), to capture both frequency and periodic behavior of the patterns. Experimental results show that the proposed patterns can provide useful information and our algorithm is efficient.
R. Uday Kiran, Masaru Kitsuregawa

High Utility Rare Itemset Mining over Transaction Databases

High-Utility Rare Itemset (HURI) mining finds itemsets from a database which have their utility no less than a given minimum utility threshold and have their support less than a given frequency threshold. Identifying high-utility rare itemsets from a database can help in better business decision making by highlighting the rare itemsets which give high profits so that they can be marketed more to earn good profit. Some two-phase algorithms have been proposed to mine high-utility rare itemsets. The rare itemsets are generated in the first phase and the high-utility rare itemsets are extracted from rare itemsets in the second phase. However, a two-phase solution is inefficient as the number of rare itemsets is enormous as they increase at a very fast rate with the increase in the frequency threshold. In this paper, we propose an algorithm, namely UP-Rare Growth, which uses UP-Tree data structure to find high-utility rare itemsets from a transaction database. Instead of finding the rare itemsets explicitly, our proposed algorithm works on both frequency and utility of itemsets together. We also propose a couple of effective strategies to avoid searching the non-useful branches of the tree. Extensive experiments show that our proposed algorithm outperforms the state-of-the-art algorithms in terms of number of candidates.
Vikram Goyal, Siddharth Dawar, Ashish Sureka

Skyband-Set for Answering Top-k Set Queries of Any Users

Skyline computation fails to response variant queries that need to analyze not just individual object of a dataset but also their combinations. Therefore set skyline has attracted considerable research attention in the past few years. In this paper, we propose a novel variant of set skyline query called the “skyband-set” query. We consider a problem to select representative distinctive objectsets in a numerical database. Let s be the number of objects in each set and n be the total number of objects in the database. The number of objectsets in the database amounts to n C s . We propose an efficient algorithm to compute skyband-set of the n C s sets where the cardinality of s varies from 1 to n. We investigate properties of skyband-set query computation and develop pruning strategies to avoid unnecessary objectset enumerations as well as comparisons among them. We conduct a set of experiments to show the effectiveness and efficiency of the propose algorithm.
Md. Anisuzzaman Siddique, Asif Zaman, Yasuhiko Morimoto

Information and Knowledge Management

Towards an Ontology-Based Generic Pipeline Editor

The pipeline concept is widely used in computer science to represent non-sequential computations, from scientific workflows to streaming transformation languages. While pipelines stand out as a highly visual representation of computation, several pipeline languages lack visual editors of production quality. We propose a method by which a generic pipeline editor can be built, centralizing the features needed to maintain and edit different pipeline languages. To foster adoption, especially in less programming-savvy communities, the proposed visual editor will be web-based. An ontology-based approach is adopted for the description of both the general features of the pipelines and the specific languages to be supported. Concepts, properties and constraints are defined using the Web Ontology Language (OWL), providing grounding in existing standards and extensibility. The work also leverages existing ontologies defined for scientific worlkflows.
Paolo Bottoni, Miguel Ceriani

Synthetic Evidential Study as Primordial Soup of Conversation

Synthetic evidential study (SES for short) is a novel technology-enhanced methodology for combining theatrical role play and group discussion to help people spin stories by bringing together partial thoughts and evidences. SES not only serves as a methodology for authoring stories and games but also exploits the framework of game framework to help people sustain in-depth learning. In this paper, we present the conceptual framework of SES, a computational platform that supports the SES workshops, and advanced technologies for increasing the utility of SES. The SES is currently under development. We discuss conceptual issues and technical details to delineate how much we can implement the idea with our technology and how much challenges are left for the future work.
Toyoaki Nishida, Atsushi Nakazawa, Yoshimasa Ohmoto, Christian Nitschke, Yasser Mohammad, Sutasinee Thovuttikul, Divesh Lala, Masakazu Abe, Takashi Ookaki

Understanding Software Provisioning: An Ontological View

In the areas involving data relatedness analysis and big data processing (such as information retrieval and data mining) one of common ways to test developed algorithms is to deal with their software implementations. Deploying software as services is one of possible ways to support better access to research algorithms, test collections and third party components as well as their easier distribution. While provisioning software to computing clouds researchers often face difficulties in process of software deployment. Most research software programs utilize different types of unified interface; among them there are many desktop command-line console applications which are unsuitable for execution in networked or distributed environments. This significantly complicates the process of distributing research software via computing clouds. As a part of knowledge driven approach to provisioning CLI software in clouds we introduce a novel subject domain ontology which is purposed to describe and support processes of software building, configuration and execution. We pay special attention to the process of fixing recoverable build and execution errors automatically. We study how ontologies targeting specific build and runtime environments can be defined by using the software provisioning ontology as a conceptual core. We examine how the proposed ontology can be used in order to define knowledge base rules for an expert system controlling the process of provisioning applications to computing clouds and making them accessible as web services.
Evgeny Pyshkin, Andrey Kuznetsov, Vitaly Klyuev

*AIDA: A Language of Big Information Resources

Some features of *AIDA language and its environment are provided to show a way for possible preparing well-organized information resources which are based on integrated-data architecture supporting searching, understanding and immediate re-use of the resources needed. A project of big information resources of the above mentioned type is presented and relations of users and resource unit owners within Global Knowledge Market are briefly considered. Some ideas behind knowledge and experience transfer with permanent re-evaluating resource unit values and examples of the resource types are also provided.
Yutaka Watanobe, Nikolay Mirenkov

Business Data Analytics and Visualization

Interactive Tweaking of Text Analytics Dashboards

With the increasing importance of text analytics in all disciplines, e.g., science, business, and social media analytics, it has become important to extract actionable insights from text in a timely manner. Insights from text analytics are conventionally presented as visualizations and dashboards to the analyst. While these insights are intended to be set up as a one-time task and observed in a passive manner, most use cases in the real world require constant tweaking of these dashboards in order to adapt to new data analysis settings. Current systems supporting such analysis have grown from simplistic chains of aggregations to complex pipelines with a range of implicit (or latent) and explicit parametric knobs. The re-execution of such pipelines can be computationally expensive, and the increased query-response time at each step may significantly delay the analysis task. Enabling the analyst to interactively tweak and explore the space allows the analyst to get a better hold on the data and insights. We propose a novel interactive framework that allows social media analysts to tweak the text mining dashboards not just during its development stage, but also during the analytics process itself. Our framework leverages opportunities unique to text pipelines to ensure fast response times, allowing for a smooth, rich and usable exploration of an entire analytics space.
Arnab Nandi, Ziqi Huang, Man Cao, Micha Elsner, Lilong Jiang, Srinivasan Parthasarathy, Ramiya Venkatachalam

Topic-Specific YouTube Crawling to Detect Online Radicalization

Online video sharing platforms such as YouTube contains several videos and users promoting hate and extremism. Due to low barrier to publication and anonymity, YouTube is misused as a platform by some users and communities to post negative videos disseminating hatred against a particular religion, country or person. We formulate the problem of identification of such malicious videos as a search problem and present a focused-crawler based approach consisting of various components performing several tasks: search strategy or algorithm, node similarity computation metric, learning from exemplary profiles serving as training data, stopping criterion, node classifier and queue manager. We implement two versions of the focused crawler: best-first search and shark search. We conduct a series of experiments by varying the seed, number of n-grams in the language model based comparer, similarity threshold for the classifier and present the results of the experiments using standard Information Retrieval metrics such as precision, recall and F-measure. The accuracy of the proposed solution on the sample dataset is 69% and 74% for the best-first and shark search respectively. We perform characterization study (by manual and visual inspection) of the anti-India hate and extremism promoting videos retrieved by the focused crawler based on terms present in the title of the videos, YouTube category, average length of videos, content focus and target audience. We present the result of applying Social Network Analysis based measures to extract communities and identify core and influential users.
Swati Agarwal, Ashish Sureka

ATSOT: Adaptive Traffic Signal Using mOTes

This paper presents design and development of Adaptive Traffic Signal using mOTes (ATSOT) system for crossroads to reduce the average waiting time in order to help the commuter drive smoother and faster. Motes are used in the proposed system to collect and store the data. This paper proposes an adaptive algorithm to select green light timings for crossroads in real time environment using clustering algorithm for VANETs. Clustering algorithms are used in VANETs to reduce message transfer, increase the connectivity and provide secure communication among vehicles. Direction and position of vehicles is used in literature for clustering. In this paper, difference in the speed of vehicles is also considered along with direction, node degree, and position to create reasonably stable clusters. A mechanism to check the suitability of cluster initiator is also proposed in the paper. The proposed ATSOT system can be used for hassle free movement of vehicles across the crossroads. Prototype of the system has been designed and developed using open source software tools: MOVE for the Mobility model generation, SUMO for traffic simulation, TraCI for traffic control Interface and Python for client scripting to initiate and control the simulation. Results obtained by simulating ATSOT approach are compared with both OAF algorithm for adaptive traffic signal control and pre-timed approach to show the efficiency in terms of reduced waiting timing at the crossroads. Results are also compared with pre-timed method for single lane and multi-lane environment using Webster’s delay function.
Punam Bedi, Vinita Jindal, Heena Dhankani, Ruchika Garg

Covariance Structure and Systematic Risk of Market Index Portfolio

A set of 18 stocks, selected as the current components of the Dow Jones Index, for which the historical daily closing data quoted at the US market are available for over four decades, is studied. Within this portfolio, we construct a market index with static weights, defined as the relative aggregate trading amounts for each stock. This market portfolio is studied by means of correlation and covariance analysis for the times series of logarithmic returns. Although no measure defined at the correlation/covariance matrices could be found as a definite precursor of market crashes and bubbles, which thus appear as a rather sudden phenomenon, there is an increase in the covariance measures for large absolute values of the logarithmic return of the index. This effect is stronger for the negative values of the log return, corresponding to the market crash case, during which the first principal component of the covariance matrix tends to describe larger proportion of the total market volatility. Periods of low volatility in the market can be characterized by rather significant spread of the relative importance of the first principal component. This finding is common also for the case of dynamically constructed market index, for which the weights are computed as the coordinates of the first principal component eigenvector using short-term covariance matrices.
Lukáš Pichl

Networked Information Resources I

Moving from Relational Data Storage to Decentralized Structured Storage System

The utmost requirement of any successful application in today’s environment is to extract the desired piece of information from its Big Data with a very high speed. When Big Data is managed via traditional approach of relational model, accessing speed is compromised. Moreover, relational data model is not flexible enough to handle big data use cases that contains a mixture of structured, semi-structured, and unstructured data. Thus, there is a requirement for organizing data beyond relational model in a manner which facilitates high availability of any type of data instantly. Current research is a step towards moving relational data storage (PostgreSQL) to decentralized structured storage system (Cassandra), for achieving high availability demand of users for any type of data (structured and unstructured) with zero fault tolerance. For reducing the migration cost, the research focuses on reducing the storage requirement by efficiently compressing the source database before moving it to Cassandra.
Experiment has been conducted to explore the effectiveness of migration from PostgreSQL database to Cassandra. A sample data set varying from 5,000 to 50,000 records has been considered for comparing time taken during selection, insertion, deletion, and searching of records in relational database and Cassandra. The current study found that Cassandra proves to be a better choice for select, insert, and delete operations. The queries involving the join operation in relational database are time consuming and costly. Cassandra proves to be search efficient in such cases, as it stores the nodes together in alphabetical order, and uses split function.
Upaang Saxena, Shelly Sachdeva, Shivani Batra

Comparing Infrastructure Monitoring with CloudStack Compute Services for Cloud Computing Systems

CloudStack is an open source IaaS cloud that provides compute, network and storage services to the users. Efficient management of available resources in the cloud is required in order to improve resource utilization and offer predictable performance to the customers. To facilitate providing of better quality of service, high availability and good performance; a comprehensive, reliable, centralized and accurate monitoring system is required. For this, the data needs to be collected from the components of CloudStack and analyzed in an efficient manner. In this paper, we present a detailed list of attributes required for monitoring the infrastructure associated with CloudStack. We identify the processes related with the compute services and its associated parameters that need to be monitored. We categorize the infrastructure monitoring, and list the parameters for monitoring parameters. Further, the proposed list is applied to three monitoring software that are commonly used for monitoring resources and processes associated with CloudStack. Developers and system administrators can benefit from this list while selecting the monitoring software for their system. The list is useful during the development of new monitoring software for CloudStack, as the functionality to be monitored can be selected from the list.
Aparna Datt, Anita Goel, Suresh Chand Gupta

Efficiency of NoSQL Databases under a Moderate Application Load

Consider the fact, that the concepts of NoSQL databases have been developed and recently, big Internet companies such as Google, Amazon, Yahoo!, and Facebook are using NoSQL databases. Although the primary focus of NoSQL databases is to deal with huge volume of heterogeneous data, these can also be suited for handling moderate volume of data, especially if the data are heterogeneous and there are frequent changes in data. Considering this we consider the development and implementation of an application with moderate volume of heterogeneous data using a NoSQL database. We perform comparative performance analysis with a relational database system. The experimental evaluations show that NoSQL databases are also often suitable for handling moderate volume of data.
Mohammad Shamsul Arefin, Khondoker Nazibul Hossain, Yasuhiko Morimoto

Business Data Analytics in Astronomy and Sciences

A Large Sky Survey Project and the Related Big Data Analysis

We explore the frontier of statistical computaional cosmology using the large imaging data delivered by Subaru Hyper-Suprime-Cam (HSC) Survey. The large sky usrvey is led by an international group and will utilizes the 8.3-meter telescope Subaru for 300 nights during the period of 2014 to 2019. Deep images of a large fraction of sky over one thousand square degrees will be collected. Our objectives here are of two folds: we analyse the images of about a half billion galaxies to reconstruct the distribution of cosmic dark matter, and we detect a few hundred supernovae that can be used as distance indicators. Combined together, these two data will enable us to derive fundamental parameters, the so-called cosmological paramaters, and to predict the evolution of the universe to the future.
Naoki Yoshida

A Photometric Machine-Learning Method to Infer Stellar Metallicity

Following its formation, a star’s metal content is one of the few factors that can significantly alter its evolution. Measurements of stellar metallicity ([Fe/H]) typically require a spectrum, but spectroscopic surveys are limited to a few×106 targets; photometric surveys, on the other hand, have detected > 109 stars. I present a new machine-learning method to predict [Fe/H] from photometric colors measured by the Sloan Digital Sky Survey (SDSS). The training set consists of ~120,000 stars with SDSS photometry and reliable [Fe/H] measurements from the SEGUE Stellar Parameters Pipeline (SSPP). For bright stars (g′ ≤ 18 mag), with 4500 K ≤ t eff ≤ 7000 K, corresponding to those with the most reliable SSPP estimates, I find that the model predicts [Fe/H] values with a root-mean-squared-error (RMSE) of ~0.27 dex. The RMSE from this machine-learning method is similar to the scatter in [Fe/H] measurements from low-resolution spectra.
Adam A. Miller

Query Languages for Domain Specific Information from PTF Astronomical Repository

The increasing availability of vast amount of astronomical repositories on the cloud has enhanced the importance of query language for the domain-specific information. The widely used keyword-based search engines (such as Google or Yahoo), fail to suffice for the needs of skilled/semi-skilled users due to irrelevant returns. The domain specific astronomy query tools (such as Astroquery, CDS Portal, or XML) provide a single entry point to search and access multiple astronomical repositories, however these lack easy query composition tools in unit-step or multi-stages query. Based on the previous research studies on domain-specific query language tools, we aim to implement a query language for obtaining the domain-specific information from the astronomical repositories (such as PTF data).
Yilang Wu, Wanming Chu

Networked Information Resources II

Pariket: Mining Business Process Logs for Root Cause Analysis of Anomalous Incidents

Process mining consists of extracting knowledge and actionable information from event-logs recorded by Process Aware Information Systems (PAIS). PAIS are vulnerable to system failures, malfunctions, fraudulent and undesirable executions resulting in anomalous trails and traces. The flexibility in PAIS resulting in large number of trace variants and the large volume of event-logs makes it challenging to identify anomalous executions and determining their root causes. We propose a framework and a multi-step process to identify root causes of anomalous traces in business process logs. We first transform the event-log into a sequential dataset and apply Window-based and Markovian techniques to identify anomalies. We then integrate the basic event-log data consisting of the Case ID, time-stamp and activity with the contextual data and prepare a dataset consisting of two classes (anomalous and normal). We apply Machine Learning techniques such as decision tree classifiers to extract rules (explaining the root causes) describing anomalous transactions. We use advanced visualization techniques such as parallel plots to present the data in a format making it easy for a process analyst to identify the characteristics of anomalous executions. We conduct a triangulation study to gather multiple evidences to validate the effectiveness and accuracy of our approach.
Nisha Gupta, Kritika Anand, Ashish Sureka

Modeling Personalized Recommendations of Unvisited Tourist Places Using Genetic Algorithms

Immense amount of data containing information about preferences of users can be shared with the help of WWW and mobile devices. The pervasiveness of location acquisition technologies like Global Positioning System (GPS) has enabled the convenient logging of movement histories of users. GPS logs are good source to extract information about user’s preferences and interests. In this paper, we first aim to discover and learn individual user’s preferences for various locations they have visited in the past by analyzing and mining the user’s GPS logs. We have used the GPS trajectory dataset of 178 users collected by Microsoft Research Asia’s GeoLife project collected in a period of over four years. These preferences are further used to predict individual’s interest in an unvisited location. We have proposed a novel approach based on Genetic Algorithm (GA) to model the interest of user for unvisited location. The two approaches have been implemented using Java and MATLAB and the results are compared for evaluation. The recommendation results of proposed approach are comparable with matrix factorization based approach and shows improvement of 4.1 % (approx.) on average root mean squared error (RMSE).
Sunita Tiwari, Saroj Kaushik

A Decentralised Approach to Computer Aided Teaching via Interactive Documents

We demonstrate how certain interactive features of learning management systems, i.e., quizzes and certain types of exercises including evaluations and feedback, can be incorporated into \(\mbox{\LaTeX}\)-generated pdf-documents via embedded javascript-code. The embedded computational abilities of the enhanced pdf-documents which are discussed in this work allow for the repeated creation and presentation of exercises (within one session of use of such the document), if the exercises are designed to contain components that are randomly generated or randomly selected from a limited family of predesigned choices stored within the document. This enables the user/student to use the embedded exercises in the described interactive teaching documents for drill purposes. In addition, the use of such documents as extensions of common learning management systems is discussed.
Lothar M. Schmitt


Additional information

Premium Partner

    Image Credits