nach oben

2015 | Buch

Kapitel lesen Erstes Kapitel lesen

Data Science

Second International Conference, ICDS 2015, Sydney, Australia, August 8-9, 2015, Proceedings

herausgegeben von: Chengqi Zhang, Wei Huang, Yong Shi, Philip S. Yu, Yangyong Zhu, Yingjie Tian, Peng Zhang, Jing He

Verlag: Springer International Publishing

Buchreihe : Lecture Notes in Computer Science

Enthalten in: Springer Professional "Wirtschaft+Technik" , Springer Professional "Technik" , Springer Professional "Wirtschaft"

Einloggen, um Zugang zu erhalten

Über dieses Buch

This book constitutes the refereed proceedings of the Second International Conference on Data Science, ICDS 2015, held in Sydney, Australia, during August 8-9, 2015.

The 19 revised full papers and 5 short papers presented were carefully reviewed and selected from 31 submissions. The papers focus on the following topics: mathematical issues in data science; big data issues and applications; data quality and data preparation; data-driven scientific research; evaluation and measurement in data service; big data mining and knowledge management; case study of data science; social impacts of data science.

Inhaltsverzeichnis

Frontmatter

Design of Personalized News Comments Recommendation System

Abstract

Nowadays people spend lots of time on browsing news on the Internet. News comment as one of the most common things that people find on the website, is earning more attention than before. News comments have significant impacts on people’s decision and behavior as news itself. People find that they are always overwhelmed by massive comments and valuable comments are drowned in large amounts of uninteresting comments. This paper presents a multi-dimensional classification system and the personalized recommendation system of news comments, which aims to provide comments classification and personalized recommendation services. With this system, users will get a better users experience and get a comprehensive view of the news and comments with cheaper time cost.

Mingnan Zhou, Ruisheng Shi, Zhaozhen Xu, Yuan He, Yiyi Zhou, Lina Lan

Minimizing the Social Influence from a Topic Modeling Perspective

Abstract

In this paper, we address the problem of minimizing the negative influence of undesirable things in a network by blocking a limited number of nodes from a topic modeling perspective. When undesirable thing such as a rumor or an infection emerges in a social network and part of users have already been infected, our goal is to minimize the size of ultimately infected users by blocking k nodes outside the infected set. We first employ the HDP-LDA and KL divergence to analysis the influence and relevance from a topic modeling perspective. Then two topic-aware heuristics based on betweenness and out-degree for finding approximate solutions to this problem are proposed. Using two real networks, we demonstrate experimentally the high performance of the proposed models and learning schemes.

Qipeng Yao, Li Guo

A Study on Optimal Policy for Purchase Data Updating in ERP Systems

Abstract

In the age of big data, it is a challenging task for ERP systems to maintain data timeliness over changing data sources. Purchase data is an important dynamic data and its timeliness directly affects the accuracy of inventory data and purchase plans. According to the characteristics of Markov decision process, we design a dynamic programming algorithm to obtain the optimal purchase data updating policy. Its effectiveness is tested by comparing with traditional fixed interval policies with real-life enterprise data. The comparison results show the proposed updating policy outperforms the fixed interval policies and can be applied to enterprises when updating ERP systems.

Wei Zong, Feng Wu, Zhengrui Jiang, Yi Qu

Research of Community Partition Based on the Modularity in Signed Network

Abstract

For the characteristics of topology structure attributes and edges’ signed attributes in signed networks, a novel method of Signed Network Community Partition is proposed. Firstly, based on signed attributes, the initial center vertex is selected as random walk starting vertex. Secondly, according to the theory of metastable, confirm random walk step length L. Finally, achieve community partition of signed network on the basis of the maximum network modularity. Experiments show that the effect of community partition of this method is better than existing methods.

Jingfeng Guo, Xiao Chen, Junli Yu, Chaozhi Fan, Miaomiao Liu

LDA Based Event Extraction: Detecting Influenza Epidemics Using Microblog

Abstract

As a major public health concern, influenza epidemics causes tens of millions respiratory illnesses worldwide each year. With the development of social network, interaction platform like microblog, is generating massive data providing us a faster and more accurate way to predict the trends in the spread of influenza, which can help us reduce the impact cause by the influenza. The problem of influenza epidemics prediction through Chinese microblog cannot be easily addressed by applying existing approaches and methods, some of which have been used for English documents. Besides, different from traditional text, the microblog is big in volume, update velocity, noise and small in the individual text volume, which cause that traditional deeper semantic analysis method like SVM is inefficient and easy to be over-fitting. To address this problem, we present a deeper semantic analysis to Chinese microblog using a LDA based event extraction framework. Our experiment using 332,886 microblogs from south and north China showed that our method achieved more detailed information extraction about the flu and an earlier flu prediction than the Chinese official ILI data.

Jingwei Li, Wayne Huang, Ping Chen

A Friend Recommendation System Using Users’ Information of Total Attributes

Abstract

Social network services, such as Facebook and Twitter in U.S.A., RenRen, QQ and Weibo in China, have grown substantially in recent years. Friend recommendation is an important emerging social network service component, which expands the networks by actively recommending new potential friends to users. We introduce a new friend recommendation system using a user’s information of total attributes and based on the Law of total probability. The proposed method can be easily extended according to the number of user’s attributes in different social networks. Our experimental results have demonstrated that superior performance the proposed method. In our empirical studies, we have observed that the performance of our algorithm is related with the number of user’s friends. Our findings have important and practical applications in social network design and performance.

Zhou Zhang, Yuewen Liu, Wei Ding, Wei Wayne Huang

Discovering Sequential Rental Patterns by Fleet Tracking

Abstract

As one of the most well-known methods on customer analysis, sequential pattern mining generally focuses on customer business transactions to discover their behaviors. However in the real-world rental industry, behaviors are usually linked to other factors in terms of actual equipment circumstance. Fleet tracking factors, such as location and usage, have been widely considered as important features to improve work performance and predict customer preferences. In this paper, we propose an innovative sequential pattern mining method to discover rental patterns by combining business transactions with the fleet tracking factors. A novel sequential pattern mining framework is designed to detect the effective items by utilizing both business transactions and fleet tracking information. Experimental results on real datasets testify the effectiveness of our approach.

Xinxin Jiang, Xueping Peng, Guodong Long

A Fast Climbing Approach for Diffusion Source Inference in Large Social Networks

Abstract

In this era of information explosion, how to discover potential useful information in social networks and further locate the source has become of great importance. However, in front of the large scale social networks, the large calculation cost is the key difficulty in source locating algorithms. Aiming at this problem, we present a fast method based on climbing algorithms to locate the information source with less calculation cost in large scale social networks. Experimental results on both generated and real-world data sets show that our algorithm is more faster than existing algorithms, since it needs fewer iterations.

Wenyu Zang, Xiao Wang, Qipeng Yao, Li Guo

Satellite Data Science: A Case Study for Smog Disaster Prediction from Multiple Satellite Observations

Abstract

Smog Disaster studies of $\text {PM}_{2.5}$ are limited by the lack of monitoring data, especially in developing countries. Satellite observations offer valuable global information about $\text {PM}_{2.5}$ concentrations, but have limited accuracy and completeness. In contrast to satellite domain-driven methods for $\text {PM}_{2.5}$ retrieval, our approach is satellite data-driven. Challenges and our proposed solutions discussed here in context of global scale $\text {PM}_{2.5}$ estimation include (i) $\text {PM}_{2.5}$ regression from Aerosol Optical Depth (AOD) data; (ii) training such a multi-view model for robust performance across multiple satellite measures; and (iii) the model for incomplete data avoids direct imputation of the missing elements. Experimental results on real-world data sets show that it significantly outperforms the existing approaches.

Ming Wu, Huajun Chen, Jiaoyan Chen

An Algebra Description for Hard Clustering

Abstract

Hard clustering algorithm partitions data set into several distinct regions. Clustering result offers a kind of characterization for the distribution of data relied on concentration. At the same time, the cluster structure can be regarded as a representation of knowledge in the form of data. However, as a sort of unsupervised learning task, due to a lack of overall criterion for evaluating the effect of clustering algorithms, different clustering algorithms lead to different results based on different considerations. Because of this uncertainty of single clustering result, by virtue of algebra tools, this paper tries to obtain a more reasonable cluster structure by combining various hard clustering results. Furthermore, based on the algebra representation and topological description of clustering, lattice theory and latticized topology can be employed, which allows us to define algebra operations and discuss topology property on clustering results.

Bo Wang, Yong Shi, Zhuofan Yang, Xuchan Ju

Homeomorphism Between Fuzzy Number Space and the Space of Bounded Functions with Same Monotonicity on $$[-1,1]$$ [ - 1 , 1 ]

Abstract

In this paper, based on the fuzzy structured element, we prove that there is a bijection function between the fuzzy number space $\varepsilon ^1$ and the space $B[-1, 1]$, which defined as a set of standard monotonic bounded functions with monotonicity on interval $[-1, 1]$. Furthermore, a new approach based upon the monotonic bounded functions has been proposed to create fuzzy numbers and represent them by suing fuzzy structured element. In order to make two different metrics based space in $B[-1, 1]$, Hausdorff metric and $L_p$ metric, which both are classical functional metrics, is adopted and their topological properties is discussed. In addition, by the means of introducing fuzzy functional to space $B[-1, 1]$, we present two new fuzzy number’s metrics. Finally, according to the proof of homeomorphism between fuzzy number space $\varepsilon ^1$ and the space $B[-1, 1]$, it’s argued that not only it gives a new way to study the fuzzy analysis theory, but also make the study of fuzzy number space easier.

Huadong Wang, Sicong Guo, Yong Shi

Regression-Based Outlier Detection of Sensor Measurements Using Independent Variable Synthesis

Abstract

We present an improved outlier detection method using a regression model. A synthesized signal using the measurements of different sensors is applied for the estimation of the model parameters. The artificial and real dataset are used to verify the proposed method. The preliminary experiments show improvement in the regression-based outlier detection method.

Chang Mok Park, Jesung Jeon

Supervised Object Boundary Detection Based on Structured Forests

Abstract

Object boundary detection is an interesting and challenging topic in computer vision. Learning and combining the local, mid-level and high-level information play an important role in most of the recent approaches. However, few characteristics of a certain type of object are exploited. In this paper, we propose a novel supervised machine learning framework for object boundary detection, which makes use of the specific object features, such as boundary shape, directions and intensity. In the learning process, structured forest models are employed to tackle the high dimensional multi-class problem. Various experiment results show that our framework outperforms the competing models in the proposed data set, indicating that our framework is highly effective in modeling boundary for specific type of objects.

Fan Meng, Zhiquan Qi, Limeng Cui, Zhensong Chen, Yong Shi

Pavement Distress Detection Using Random Decision Forests

Abstract

Pavement distress detection is a key technology to evaluate pavement surface and crack severity. However, there are many challenging problems when using pavement distress detection technology to do road maintenance, such as the inference of textured surroundings with similar intensity to the distresses, the existence of intensity inhomogeneity along the distresses and the requirement of real-time detection in practice. To address these problems, we propose a novel method for pavement distress detection based on random decision forests. By introducing the color gradient features at multiple scales commonly used in contour detection, we extend the feature set of traditional distress detection methods and get the represented crack with richer information. During the process of training, we apply a subsampling strategy at each node to maintain the diversity of trees. With this work, we finally solve all the three problems mentioned above. In addition, according to the characteristics of random decision forests, our method is easy to parallel and able to conduct real-time detection. Experimental results show that our approach is faster and more accurate than existing methods.

Limeng Cui, Zhiquan Qi, Zhensong Chen, Fan Meng, Yong Shi

STMM: Semantic and Temporal-Aware Markov Chain Model for Mobility Prediction

Abstract

Information theoretic measures and probabilistic techniques have been applied successfully to human mobility datasets to show that human mobility is highly predictable up to an upper bound of 95 % prediction accuracy. Motivated by this finding, we propose a novel Semantic and Temporal-aware Mobility Markov chain (STMM) model to predict anticipated mobility of a target individual. Despite being an extensively studied topic in recent years, human mobility prediction by the vast majority of existing studies have mostly focused on predicting the geo-spatial context, and in rare cases, the temporal context of human mobility. We argue that an explicit and comprehensive analysis of semantic and temporal context of users’ mobility is necessary for realistic understanding and prediction of mobility. In line with this, our proposed model simultaneously utilizes semantic and temporal features of a target individual’s historical mobility data to predict their mobility, given his/her current location context (time and semantic tag of the location). We evaluate our approach on a real world GPS trajectory dataset.

Hamidu Abdel-Fatao, Jiuyong Li, Jixue Liu

3D Model-Based Food Traceability Information Extraction Framework

Abstract

In this paper, we propose a 3D model-based food traceability information extraction method for processing video surveillance data. The proposed method first builds a 3D model of the surveillance area. Then, the video cameras are mapped in the 3D model and the coordinate transform functions from the 2D camera coordinates to the 3D model coordinates are calculated. Next, the object detection method is applied to identify the target which is then mapped into the 3D coordinates so that its 3D trajectory can be generated. Finally, we merge multiple trajectories from different cameras to create the complete traceability information for the target object. According to the experimental results, the proposed method can efficiently extract useful traceability information for a video surveillance system.

Bo Mao, Jing He, Jie Cao, Stephen Bigger, Todor Vasiljevic

A Spark-Based Big Data Platform for Massive Remote Sensing Data Processing

Abstract

With the fast development of remote sensing techniques, the volume of acquired data grows exponentially. This brings a big challenge to process massive remote sensing data. In the paper, an in-memory computing framework is proposed to address this problem. Here, Spark is an open-source distributed computing platform with Hadoop YARN as resource scheduler and HDFS as cloud storage system. On the Spark-based platform, data loaded into memory in the first iteration can be reused in the subsequent iterations. This mechanism makes Spark much suitable for running multi-iteration algorithms compared to MapReduce which has to load data in each iteration. The experiments are carried out on massive remote sensing data using multi-iteration singular value decomposition (SVD) algorithm. The results show that Spark-based SVD can obtain significantly faster computation timethan that by MapReduce, usually by one order of magnitude.

Zhongyi Sun, Fengke Chen, Mingmin Chi, Yangyong Zhu

A Vehicle Routing Problem with Time Windows for Attended Home Distribution

Abstract

The reliable and efficient last three mile of delivery results in enormous challenges for city logistics. In recent years, the combination of telematics based big data collection and O2O e-commerce has built the ground for time-dependent vehicle routing, which becomes extremely important in the home delivery applications. This paper proposes a logistics platform to solve the order fulfillment problem of on-demand delivery service with large quantities of orders. The problem can be considered as a special vehicle routing problem with considering the link time and cost between the store and the delivery destinations designated by customers, who are associated with time windows and vehicles with capacity. We then propose a Genetic Algorithm (GA) method. Experimental results show that the proposed approach is highly feasible and very potential in dealing with the present order fulfillment problem.

Yi Qu, Feng Wu, Wei Zong

Active Class Discovery by Querying Pairwise Label Homogeneity

Abstract

Active learning traditionally focuses on labeling the most informative instances for some well defined learning tasks with known class labels, and a labeler is provided to label each queried instance. In an extreme case, the whole active learning task may start without any available information about the tasks, for instance, no labeled data are available at the initial stage and the labeler is incapable of providing the ground truth to each queried instance. In this paper, we propose an active class discovery method for the case where no randomly labeled instances exist to kick-off the learning circle and the labeler only has weak knowledge to answer whether a pair of instances belong to the same class or not. To roughly identify the classes in the data, a Minimum Spanning Tree based query strategy is employed to discover a number of classes from unlabeled data. Experiments and comparisons demonstrate superior performance of the proposed method for class discovery tasks.

Yifan Fu, Junbin Gao, Xingquan Zhu

Discovering Productive Periodic Frequent Patterns in Transactional Databases

Abstract

Periodic frequent pattern mining is an important data mining task for various decision making. However, it often presents a large number of periodic frequent patterns, most of which are not useful as their periodicities are due to random occurrence of uncorrelated items. Such periodic frequent patterns would most often be detrimental in decision making where correlations between the items of periodic frequent patterns are vital. To enable mine the periodic frequent patterns with correlated items, we employ a correlation test on periodic frequent patterns and introduce the productive periodic frequent patterns as the set of periodic frequent patterns with correlated items. We finally develop PPFP, an efficient Productive Periodic Frequent Pattern mining framework. PPFP is efficient and the productiveness measure removes the periodic frequent patterns with uncorrelated items.

Vincent Mwintieru Nofong

Building Computational Virtual Reality Environment for Anesthesia

Abstract

In this paper, a Computational Virtual Reality Environment for Anesthesia (CVREA) is proposed. Virtual reality, data mining, machine learning techniques will be explored to develop (1) an immersive and interactive training platform for anaesthetists, which can greatly improve their training and learning performance; (2) a knowledge learning environment which collects clinical data with greater richness, process data with more efficacy, and facilitate knowledge discovery in anaesthesiology.

Xinyu Cao, Peng Zhang, Jing He, Guangyan Huang

Study of the Noise Level in the Colour Fundus Images

Abstract

Diabetic Retinopathy (DR) causes vision loss insufficiency due to impediment rising from high sugar level conditions disturbing the retina. The Progression of DR occurs in the Foveal avascular zone (FAZ) due to loss of tiny blood vessels of capillary network. Due to image acquisition process of fundus camera, the colour retinal fundus image suffers from varying contrast and noise problems. To overcome varying contrast and noise problem in fundus image, the technique has been implemented. The technique is contained on the Retinex algorithm along with stationary wavelet transform. The technique has been applied on 36 high resolution fundus (HRF) image database contain the 18 bad quality images and 18 good quality images. The RETSWT (RETinex and Stationary Wavelet Transform) developed with introduces denoising techniques. Stationary wavelet transform is used as denoised technique. RETSWT achieved the average PSNR improvement of 2.39 db good quality images else it achieved the average PSNR improvement of 2.20 db in the bad quality images. The RETSWT image enhancement method potentially reduces the need of the invasive fluorescein angiogram in DR assessment.

Toufique Ahmed Soomro, Junbin Gao

Optimal Search Plan Model for Lost Aircraft

Abstract

In this passage, we intend to determinate the specific searching plan for lost aircraft on the basis of big data application. First, it uses the Neural Network Model to solve the problem about area classification by means of SOM. Then, we cope with Maximum Flow Problem by BFS, in order to the determination of cruise route.

Luyao Zhu, Wenxi Hao, Zhiwei Zhu

An Informatics Approach for Smart Evaluation of Water Quality Related Ecosystem Services

Abstract

Understanding the relationship between water quality and ecosystem services valuation requires a broad range of approaches and methods from the domains of environmental science, ecology, physics and mathematics. The fundamental challenge is to decode the association between ‘ecosystem services geography’ with water quality distribution in time and in space. This demands the acquisition and integration of vast amounts of data from various domains in many formats and types. Here we present our system development concept to support the research in this field. We outline a technological approach that harnesses the power of data with scientific analytics and technology advancement in the evolution of a data ecosystem to evaluate water quality. The framework integrates the mobile applications and web technology into citizen science, environmental simulation and visualization. We describe a schematic design that links water quality monitoring and technical advances via collection by citizen scientists and professionals to support ecosystem services evaluation. These would be synthesized into big data analytics to be used for assessing ecosystem services related to water quality. Finally, the paper identifies technical barriers and opportunities, in respect of big data ecosystem, for valuating water quality in ecosystem services assessment.

Weigang Yan, Mike Hutchins, Steven Loiselle, Charlotte Hall

Credit Risk Evaluation Based on Improved Trust Evaluation Method Under Network Transaction

Abstract

Information asymmetry makes network transaction at risk, and trust is the foundation of network transactions. Under network transactions environment, the trust evaluation is important to predict the trust object’s credit risk. Therefore, on the basis of analyzing the influential factors of trust, we proposed an improved trust evaluation model based on cloud model; further, credit risk evaluation methodology was proposed based on the trust evaluation model. Taking C2C as an example to do the numerical experiment, results show that the trust evaluation model and credit risk evaluation method proposed in this paper, can make a reasonable evaluation and interpretation of the credit risk under network transactions.

Lai Hui, Huang Yumeng, Zhou Zongfang

Backmatter

Titel: Data Science
herausgegeben von: Chengqi Zhang
Wei Huang
Yong Shi
Philip S. Yu
Yangyong Zhu
Yingjie Tian
Peng Zhang
Jing He
Verlag: Springer International Publishing
Electronic ISBN: 978-3-319-24474-7
Print ISBN: 978-3-319-24473-0
DOI: https://doi.org/10.1007/978-3-319-24474-7

Springer Professional

Über dieses Buch

Inhaltsverzeichnis

Frontmatter

Design of Personalized News Comments Recommendation System

Minimizing the Social Influence from a Topic Modeling Perspective

A Study on Optimal Policy for Purchase Data Updating in ERP Systems

Research of Community Partition Based on the Modularity in Signed Network

LDA Based Event Extraction: Detecting Influenza Epidemics Using Microblog

A Friend Recommendation System Using Users’ Information of Total Attributes

Discovering Sequential Rental Patterns by Fleet Tracking

A Fast Climbing Approach for Diffusion Source Inference in Large Social Networks

Satellite Data Science: A Case Study for Smog Disaster Prediction from Multiple Satellite Observations

An Algebra Description for Hard Clustering

Homeomorphism Between Fuzzy Number Space and the Space of Bounded Functions with Same Monotonicity on $$[-1,1]$$ [ - 1 , 1 ]

Regression-Based Outlier Detection of Sensor Measurements Using Independent Variable Synthesis

Supervised Object Boundary Detection Based on Structured Forests

Pavement Distress Detection Using Random Decision Forests

STMM: Semantic and Temporal-Aware Markov Chain Model for Mobility Prediction

3D Model-Based Food Traceability Information Extraction Framework

A Spark-Based Big Data Platform for Massive Remote Sensing Data Processing

A Vehicle Routing Problem with Time Windows for Attended Home Distribution

Active Class Discovery by Querying Pairwise Label Homogeneity

Discovering Productive Periodic Frequent Patterns in Transactional Databases

Building Computational Virtual Reality Environment for Anesthesia

Study of the Noise Level in the Colour Fundus Images

Optimal Search Plan Model for Lost Aircraft

An Informatics Approach for Smart Evaluation of Water Quality Related Ecosystem Services

Credit Risk Evaluation Based on Improved Trust Evaluation Method Under Network Transaction

Backmatter

Premium Partner