Skip to main content

2017 | Buch

Data Mining and Big Data

Second International Conference, DMBD 2017, Fukuoka, Japan, July 27 – August 1, 2017, Proceedings

insite
SUCHEN

Über dieses Buch

This book constitutes the refereed proceedings of the Second International Conference on Data Mining and Big Data, DMBD 2017, held in Fukuoka, Japan, in July/August 2017.
The 53 papers presented in this volume were carefully reviewed and selected from 96 submissions. They were organized in topical sections named: association analysis; clustering; prediction; classification; schedule and sequence analysis; big data; data analysis; data mining; text mining; deep learning; high performance computing; knowledge base and its framework; and fuzzy control.

Inhaltsverzeichnis

Frontmatter

Association Analysis

Frontmatter
A Process for Exploring Employees’ Relationships via Social Network and Sentiment Analysis

The proposed study is to analyze and visualize properties of the social network constructed from a dataset based on Enron mail dataset, and utilize sentiment analysis as an additional source of information to study employees’ relationships in a company. We concluded that when social network analysis is used in conjunction with emotion detection, it is possible to see the positive or negative areas where the company must work to promote a healthy organizational culture and uncover possible organizational issues in a timely manner.

Jeydels Barahona, Hung-Min Sun
Mining Relationship Between User Purposes and Product Features Towards Purpose-Oriented Recommendation

To help in decision making, buyers in online shopping tend to go through each product features, functionalities, etc. provided by vendors and reviews made by other users, which is not an effective way when confronting loaded of information and especially if the buyers are beginner users who have limited experience and knowledge. To deal with these problems, we propose a framework of purpose-oriented recommendation which present a ranking of products suitable for a designated user purpose by identifying important product features to fulfill the purpose from user reviews. As technical foundation for realizing the framework, we propose several methods to mine relation between user purposes and product features from the online reviews. The experimental results employing reviews of digital cameras in Amazon.com show the effectiveness and stability of proposed methods with acceptable rate of precision and recall.

Sopheaktra Yong, Yasuhito Asano
Finding Top-k Fuzzy Frequent Itemsets from Databases

Frequent itemset mining is an important in data mining. Fuzzy data mining can more accurately describe the mining results in frequent itemset mining. Nevertheless, frequent itemsets are redundant for the users. A better way is to show the top-k results accordingly. In this paper, we define the score of fuzzy frequent itemset and propose the problem of top-k fuzzy frequent itemset mining, which, to the best of our knowledge, has never been focused on before. To address this problem, we employ a data structure named TopKFFITree to store the superset of the mining results, which has a significantly reduced size in comparison to all the fuzzy frequent itemsets. Then, we present an algorithm named TopK-FFI to build and maintain the data structure. In this algorithm, we employ a method to prune most of the fuzzy frequent itemsets immediately based on the monotony of itemset score. Theoretical analysis and experimental studies over 4 datasets demonstrate that our proposed algorithm can efficiently decrease the runtime and memory cost, and significantly outperform the naive algorithm Top-k-FFI-Miner.

Haifeng Li, Yue Wang, Ning Zhang, Yuejin Zhang
Association Rule Mining in Healthcare Analytics

Big data analytics examines large amounts of data to uncover hidden patterns, correlations and other insights. In this work, a novel association rule-mining algorithm is employed for finding various rules for performing valid prediction. Various traditional association mining algorithms has been studied carefully and a new mining algorithm, Treap mining has been introduced which remedies the drawbacks of the current Association Rule Learning (ARL) algorithms. Treap mining is a dynamic weighted priority model algorithm. As it works on dynamic priority, rule creation happens in least time complexity and with high accuracy. When comparing with other association mining algorithms like Apriori and Tertius, we could see that Treap algorithm mines the database in an O(n log n) when compared to Apriori’s O (en) and Tertius’s O (n2). A high precise mining model for the post Liver Transplantation survival prediction was designed using the rules mined by Treap algorithm. United Nations Organ Sharing dataset was used for the study. Rule accuracy of 96.71% was obtained while using Treap mining algorithm where as, Tertius produced 92% and Apriori created 80% valid results. The dataset has been tested in dual environment and significant improvement has been noted for Treap algorithm in both cases.

S. Anand Hareendran, S. S. Vinod Chandra
Does Student’s Diligence to Study Relate to His/Her Academic Performance?

It is often pointed out that students’ academic performance becomes worse. Lack of professors’ teaching ability is often considered its major cause, and universities promote faculty development programs. According to our observation, however, the major cause is rather on student’s side, such as lack of motivation, diligence, and other attitudes toward learning. In this paper, we focus on diligence. Diligence is quite important for students to learn effectively. Among various kinds of diligence, we take two kinds of them into consideration; the length of answer text to a questionnaire, and the amount of submitted homework assignments. We investigate how these kinds of diligence of students relate each other, and how they relate to the examination score.

Toshiro Minami, Yoko Ohura, Kensuke Baba
Correlation Analysis of Diesel Engine Performance Testing Data Based on Mixed-Copula Method

As an important quality index of diesel engine, the power of diesel engine has a direct impact on the quality and competitiveness of products. Considering the diesel engine performance testing process including many control parameters which are strongly coupled with each other, it is difficult to carry out accurate probability distribution description and correlation analysis, this paper presents a correlation analysis method based on mixed copula model to figure the correlations between multi parameters. This paper begins with the analysis of the engine performance test data with characteristic of non-normal, peak and fat tail, and according to the correlation structure of diesel engine power data, the mixed copula function is constructed by using the weighted linear model to describe asymmetric tail behavior and the expectation maximization method to estimate the related parameters. The results showed that the mixed copula function can well describe the power of diesel engine related structure and tail characteristics.

Zha Dongye, Qin Wei, Zhang Jie, Zhuang Zilong

Clustering

Frontmatter
Comparative Study of Apache Spark MLlib Clustering Algorithms

Clustering of big data has received much attention recently. Analytics algorithms on big datasets require tremendous computational capabilities. Apache Spark is a popular open- source platform for large-scale data processing that is well-suited for iterative machine learning tasks. This paper presents an overview of Apache Spark Machine Learning Library (Spark.MLlib) algorithms. The clustering methods consist of Gaussian Mixture Model (GMM), Power-Iteration Clustering method, Latent Dirichlet Allocation (LDA), and k-means are completely described. In this paper, three benchmark datasets include Forest Cover Type, KDD Cup 99 and Internet Advertisements used for experiments. The same algorithms that can be compared with each other, compared. For a better understanding of the results of the experiments, the algorithms are described with suitable tables and graphs.

Sasan Harifi, Ebrahim Byagowi, Madjid Khalilian
L-DP: A Hybrid Density Peaks Clustering Method

Density peaks (DP) clustering is a new density-based clustering method. This algorithm can deal with some data sets having non-convex clusters. However, when the shape of clusters is very complicated, it cannot find the optimal structure of clusters. In other words, it cannot discover arbitrary shaped clusters. In order to solve this problem, a new hybrid clustering method, called L-DP, is proposed in this paper combines density peaks clustering with the leader clustering method. Experiments on synthetic datasets show L-DP could be a suitable one for arbitrary shaped clusters compared with the original DP clustering method. The experimental results on real-world data sets demonstrate that the proposed algorithm is competitive with the state-of-the-art clustering algorithms, such as DP, AP and DBSCAN.

Mingjing Du, Shifei Ding

Prediction

Frontmatter
Incremental Adaptive Time Series Prediction for Power Demand Forecasting

Accurate power demand forecasts can help power distributors to lower differences between contracted and demanded electricity and minimize the imbalance in grid and related costs. Our forecasting method is designed to process continuous stream of data from smart meters incrementally and to adapt the prediction model to concept drifts in power demand. It identifies drifts using a condition based on an acceptable distributor’s daily imbalance. Using only the most recent data to adapt the model (in contrast to all historical data) and adapting the model only when the need for it is detected (in contrast to creating a whole new model every day) enables the method to handle stream data. The proposed model shows promising results.

Petra Vrablecová, Viera Rozinajová, Anna Bou Ezzeddine
Food Sales Prediction with Meteorological Data — A Case Study of a Japanese Chain Supermarket

The weather has a strong influence on food retailers’ sales, as it affects customers emotional state, drives their purchase decisions, and dictates how much they are willing to spend. In this paper, we introduce a deep learning based method which use meteorological data to predict sales of a Japanese chain supermarket. To be specific, our method contains a long short-term memory (LSTM) network and a stacked denoising autoencoder network, both of which are used to learn how sales changes with the weathers from a large amount of history data. We showed that our method gained initial success in predicting sales of some weather-sensitive products such as drinks. Particularly, our method outperforms traditional machine learning methods by 19.3%.

Xin Liu, Ryutaro Ichise
Cascade Spatial Autoregression for Air Pollution Prediction

Recent years have witnessed a growing interest in air quality prediction and a variety of predictions models have been applied for this task. However, all of these models only use local attributes of each site for prediction and neglect the spatial context. Indeed, the concentrations of air pollutants follow the first law of geography: everything is related to everything else, but nearby things are more related than distinct things. To that end, in this paper, we apply the spatial autoregression model (SAR) to air pollution prediction, which considers both local attributes and predictions from the neighborhoods. Specifically, as SAR can only handle a snapshot of spatial data but our input data are time series, we develop the cascade SAR, which is able to take care of both spatial and temporal dimensions without incurring extra computation. Finally, the effectiveness of the cascade SAR is validated on the dataset of the London Air Quality Network.

Yangping Li, Xiaorui Wei, Tianming Hu
Machine Learning Techniques for Prediction of Pre-fetched Objects in Handling Big Data Storage

Large data storage has to serve high volume transactions of data everyday when users request the data that can cause latency. Therefore, intelligent methods are required to solve the insufficient data storage experienced by some providers. Pre-fetching technique is one of the best techniques that enable assuming the data will be needed by the user in the near future. Consequently, users easily access their data at high speed to avoid latency. However, pre-fetch the wrong objects cause slow down the data management performance. In this context, this research proposes Machine Learning (ML) techniques to predicting the pre-fetched objects accurately. This paper also compares the Rough Decision Tree (RDT) with others ML techniques including J48 Decision Tree, Random Tree (RT), Naïve Bayes (NB), and Rough Set (RS). The experimental results reveal the propose RDT performs better compared with RS single-alone. However, J48 performs well in classifying the web objects for IrCache, UTM blog data, and Proxy Cloud Storage (CS) data sets. Hence, J48 was proposing to be implementing into the future work of mobile cloud storage services.

Nur Syahela Hussien, Sarina Sulaiman, Siti Mariyam Shamsuddin

Classification

Frontmatter
Spectral-Spatial Mineral Classification of Lunar Surface Using Band Parameters with Active Learning

In the field of remote sensing, the value of the large number of hyper spectral bands during classification is well documented. The collection of labeled samples is a costly affair and many semi-supervised classification methods are introduced that can make use of unlabeled samples for training. Due to the nature of these images, high dimensional spectral features must be distinctive with preservation of absorption band in mineral mapping. We propose the method in which we consider the band parameters of the spectral data combined with the neighborhood spatial information for mineral classification using Active Labeling to compensate for the lack of a large number of labeled samples. Here we demonstrate that by using these parameters for classification in conjunction with their spatial information, higher accuracies can be achieved during classification.

Sukanta Roy, Sujai Subbanna, Srinidhii Venkatesh Channagiri, Sharath R Raj, Omkar S.N
R2CEDM: A Rete-Based RFID Complex Event Detection Method

In various RFID application scenarios, RFID generates real-time and inherent unreliable raw data continually, which contain valuable enterprises business event. How to detect valuable event from RFID raw data has becoming a key issue. A Rete-based RFID complex event detection method called R2CEDM is proposed. Firstly, α_detecting net is established to detect all attribute of RFID primitive events in R2CEDM. Then β_detecting net is used to assemble RFID events into RFID complex events with the business rules, and concerned RFID complex events is obtained. The comparative experiments show the proposed R2CEDM can effectively improve the processing efficiency.

Xiaoli Peng, Linjiang Zheng, Ting Liao
Learning Analytics for SNS-Integrated Virtual Learning Environment

With the increasing interest of social media usage among the students, we are motivated to integrate this informal mode of socialized learning environment into the formal learning system to engage students for their learning activities. The existing Learning Analytics (LA) focused only on analyzing formal data obtained from controlled online learning environments and the social connections and learning experience of students are not analyzed. The expected output for this proposed work is a SNS-integrated Virtual Learning Environment (VLE) named Shelter which provides formal and informal learning for any subject domain area. User testing is conducted and both informal and formal data are stored and elicited from Shelter to investigate the impact of this data combination for more insightful LA results.

Fang-Fang Chua, Chia-Ying Khor, Su-Cheng Haw
LBP vs. LBP Variance for Texture Classification

Texture classification algorithms are utilised in various image analysis and medical imaging applications. A number of high performing texture algorithms are based on the concept of local binary patterns (LBP) characterising the relationships of pixels to their local neighbourhood. LBP descriptors are simple to calculate, are invariant to intensity changes and can be calculated in a rotation invariant manner as well as at different scales. Incorporating variance information, leading to LBP variance (LBPV) texture descriptors, has been claimed to lead to more versatile and more effective texture features. In this paper, we investigate this in more detail, benchmarking and contrasting the classification performance of several LBP and LBPV descriptors for generic image texture classification as well as two medical tasks. We show that while LBPV-based methods typically lead to improved classification performance this is not always so and that thus the inclusion of variance information is task dependent.

Gerald Schaefer, Niraj Doshi
A Decision Tree of Ignition Point for Simple Inflammable Chemical Compounds

Ignition point, the temperature at which a chemical compound begins to burn naturally, is one of the important values from the viewpoint of industry and safety. This manuscript addresses a trial prediction of the ignition point for relatively simple chemical compounds including carbon, oxygen and hydrogen via data mining such as decision tree and random forest. I used fundamental material values and the number of characteristic structures as descriptors for chemical compounds. Our input data file includes 240 kinds of chemical compounds and we prepared other 10 as the test data. At first, I used “rpart” package of the “R”, one of the statistical programming language, in order to process decision tree. Furthermore I used “randomForest” with more data and more number of descriptors and I got better estimation of ignition point.

Ryoko Hayashi
Analyzing Consumption Behaviors of Pet Owners in a Veterinary Hospital

The purpose of this study is to identify different consumption behaviors of pet owners in a veterinary hospital so as to provide proper marketing strategies. A case study was conducted by combining data mining techniques and RFM model for a veterinary hospital located in Taichung City, Taiwan by examining its transactions data focusing on pet mice in 2014. The development of marketing strategies for the veterinary hospital is important to improve its service quality and strengthen the positive relationship between the pet owners and the case veterinary hospital.

Jo Ting Wei, Shih-Yen Lin, You-Zhen Yang, Hsin-Hung Wu

Schedule and Sequence Analysis

Frontmatter
Modeling Inter-country Connection from Geotagged News Reports: A Time-Series Analysis

The rapid development of big data techniques provides growing opportunities to investigate large-scale events that emerge over space and time. This research utilizes a unique open-access dataset, “The Global Data on Events, Location and Tone” (GDELT), to model how China has connected to the rest of the world, as well as predicting how this connection may evolve over time based on an autoregressive integrated moving average (ARIMA) model. Methodologically, we examined the effectiveness of traditional time series models in predicting trends in long-term mass media data. Empirically, we identified various types of ARIMA models to depict the connection patterns between China and its top 15 related countries. This study demonstrates the power of applying GDELT and big data analytics to investigate informative patterns for interdisciplinary researchers, as well as provides valuable references to interpret regional patterns and international relations in the age of instant access.

Yihong Yuan
Mining Sequential Patterns of Students’ Access on Learning Management System

Novel pedagogical approaches supported by digital technologies such as blended learning and flipped classroom are prevalent in recent years. To implement such learning strategies, learning resources are often put online on learning management systems. The log data on those systems provide an excellent opportunity for getting more understanding about the students through data mining techniques. In this paper, we propose to use sequential pattern mining (SPM) to discover navigational patterns on a learning platform. We attempt to address the lack of literature support about conducting SPM on Moodle. We propose a method to apply SPM that is more appropriate for mining user navigational patterns. We further propose three sequence modeling strategies for mining patterns with educational implications. Results of a study on a statistics course show the effectiveness of the proposed method and the proposed sequence modeling strategies.

Leonard K. M. Poon, Siu-Cheung Kong, Michael Y. W. Wong, Thomas S. H. Yau
Design of a Dynamic Electricity Trade Scheduler Based on Genetic Algorithms

This paper presents a design and measures the performance of a dynamic electricity trade scheduler employing genetic algorithms for the convenient application of vehicle-to-grid services. Arriving at and being plugged-in to a microgrid, each electric vehicle specifies its stay time and sales amount, while the scheduler, invoked before each time slot, creates a connection schedule considering the microgrid-side demand and available electricity from vehicles for the given scheduling window. For the application of genetic operations, each schedule is encoded to an integer-valued vector with the complementary definition of C-space, which orderly lists all combinatory allocation maps for a task. Then, each integer element indexes a map entry in its C-space. The performance measurement result, obtained from a prototype implementation, reveals that our scheduler can stably work even when the number of sellers exceeds 100 as well as improves demand meet ratio by up to 6.3% compared with the conventional scheduler for the given parameter set.

Junghoon Lee, Gyung-Leen Park
Increasing Coverage of Information Spreading in Social Networks with Supporting Seeding

Campaigns based on information spreading processes within online networks have become a key feature of marketing landscapes. Most research in the field has concentrated on propagation models and improving seeding strategies as a way to increase coverage. Proponents of such research usually assume selection of seed set and the initialization of the process without any additional support in following stages. The approach presented in this paper shows how initiation by seed set process can be supported by selection and activation of additional nodes within network. The relationship between the number of additional activations and the size of initial seed set is dependent on network structures and propagation parameters with the highest performance observed for networks with low average degree and smallest propagation probability in a chosen model.

Jarosław Jankowski, Radosław Michalski

Big Data

Frontmatter
B-Learning and Big Data: Use in Training an Engineering Course

Is presents a case study in a descriptive work of qualitative Court that seeks to evaluate the advantages of the deployment and the use of the b-learning methodology and big data in pedagogical processes. There is the need for evolution of the type of traditional education currently practised in the University by a methodology that allows for greater participation and responsibility on the part of the student and which present an opportunity for development of independent learning skills. Initially develops a theoretical reference framework associated with the traditional teaching, B-learning and Big data with its approach to the field of education. Subsequently, is an approach to the existing problems in a case study employing the use of descriptive records, participant observation and interviews not structured to analyze and compare the academic performance of students in a course implementing b-learning vs. a course with traditional methods.

Leonardo Emiro Contreras Bravo, Jose Ignacio Rodriguez Molano, Giovanny Mauricio Tarazona Bermudez
Mapping Knowledge Domain Research in Big Data: From 2006 to 2016

This paper was explore a scientometric analysis of the research work in the emerging field of “Big Data” in recent years. Research on “Big Data” in the past few years, and in a short time has gained tremendous momentum. It is now considered one of the most important emerging research areas in computational science and related disciplines. By using the related literature in the Science Citation Index (SCI) database from 2006 to 2016, a scientometric approach was used to quantitatively assessing current research hotspots and trends. It shows that “Big Data” is a new emerging field with rapid development, the total of 2076 articles covered 131 countries (regions) and Top 3 countries (regions) were USA (731, 38.86%), China (373, 19.83%), England (93, 4.94%). In addition, Top 10 keywords are found to have citation bursts: epidemiology, scalability, social media, genomics, visualization, sequencing data, integration, intelligence, association, behavior. The results provided a dynamic view of the evolution of “Big Data” research hotpots and trends from various perspectives which may serve as a potential guide for future research.

Li Zeng, Zili Li, Tong Wu, Lixin Yang
Template Based Industrial Big Data Information Extraction and Query System

Currently, with the rapid development of industry, the amount of data generated by industrial enterprises and industrial business website is exponential growth, and the big data has different types. In this paper, we design and implement an industrial big data information acquisition and query system. The system is based on big data acquisition and analysis of industry news data and industrial products data. We use a template based information acquisition method to crawl data from industry related news data and industry products data. We also discuss the query performance of text industry data with text index and only by SQL without index. The system is useful for analysis the hot news in industrial field and industry public opinion, and it is also useful for providing reference and rapid search and comparison of the relevant industrial products price, inventory and other information.

Jie Wang, Yan Peng, Yun Lin, Kang Wang
Finding the Typical Communication Black Hole in Big Data Environment

“Black hole” are widely spread in the mobile communication data, which will highly downgrade the mobile service quality. OLAP tools are extensively used for the decision-support application in the multidimensional data model, which just like the mobile communication case. As different dimensions of the mobile data are incomparable and, thus, can hardly generate one unique final value that satisfies all dimensions. We exploit the skyline operator as the postoperation while building data cubes, named as data cube of skyline. As the skyline of a cuboid is not derivable from another cuboid and the skyline operation is holistic, which makes this problem even challeging. In this paper, we propose a method in materializing the cube of skyline in the big communication data and proof its effectiveness and efficiency by extensive experiments.

Jinbo Zhang, Shenda Hong, Yiyong Lin, Yanchao Mou, Hongyan Li
A Solution for Mining Big Data Based on Distributed Data Streams and Its Classifying Algorithms

With the advent of the era of big data, a require to discover valuable knowledge from big data is being one of focuses. However, big data is a term which has been described with the features of alarming velocity, super volume and various data structures, and so how to express big data to effectively mine is becoming a key problem. Aiming at requires of big data analyses, this paper will construct the concept of DDS (Distributed Data Stream), and build the mining model and some key algorithms. The experiments show integrating these algorithms under our model can get higher mining accuracies to distributed data streams.

Guojun Mao, Jiewei Qiao

Data Analysis

Frontmatter
Design of a Quality-Aware Data Capture System

Data analytics is an ever-growing field which provides insights, predictions and patterns from raw data. The outcome of analytics is greatly affected by the quality of input data on which the analytics is done. This paper explores the design of a quality-aware data capture system, which uses Data Mining Techniques and algorithms, specifically a decision-tree based approach for data validation and verification, with an objective of identifying data quality issues right at a stage when data enters the system by providing appropriate feedback through a carefully designed user-interface.

R. Vasanth Kumar Mehta, Shubham Verma
Data Architecture for the Internet of Things and Industry 4.0

This paper analyzes Internet of Things (IoT), its use into manufacturing industry, its foundation principles, available elements and technologies for the man-things-software communication already developed in this area. And it proves how important its deployment is. In that sort of systems, information process. Is related to manufacturing status, trends in energy consumption by machinery, movement of materials, customer orders, supply data and all data related to smart devices deployed in the processes. This paper describes a proposal of data architecture of the Internet of things applied to the industry, a metamodel of integration (Internet of Things, Social Networks, Cloud and Industry 4.0) for generation of applications for the Industry 4.0.

José Ignacio Rodríguez Molano, Leonardo Emiro Contreras Bravo, Eduyn Ramiro López Santana
Machine Learning in Data Lake for Combining Data Silos

Data silo can grow to be a large-scale data for years, overlapping and has an indefinite quality. It allows an organization to develop their own analytical capabilities. Data lake has the ability to solve this problem efficiently with the data analysis by using statistical and predictive modeling techniques which can be applied to enhance and support an organization’s business strategy. This study provides an overview of the process of decision-making, operational efficiency, and creating the solution for an organization. Machine Learning can distribute the architecture of data model and integrate the data silo with other organizations data to optimize the operational business processes within an organization in order to improve data quality and efficiency. Testing is done by utilizing the data from the Malaysia’s and Singapore’s Government Open Data on the Air Pollutant Index to determine the condition of air pollution levels for the health and safety of the population.

Merlinda Wibowo, Sarina Sulaiman, Siti Mariyam Shamsuddin
Development of 3D Earth Visualization for Taiwan Ocean Environment Demonstration

This paper presents a three-dimensional earth representation system for visualization of Taiwan Ocean data, based on the mixed MySQL and HBase databases. The Client-Server architecture is used to provide Web GIS services in this system. In addition, the WebGIS and WebGL are adopted to enhance 3D visualization of the ocean data. In order to achieve better data-access performance, the architecture of HBase combined with MySQL is used to accommodate different data types and sizes in the back end of system. Finally, we implement the information of sea-going onboard survey, including concentrations of chlorophyll a salinity and temperature, shore-based surface current data and PM2.5 data, onto the 3D-GIS display platform for environmental decision making in the future.

Franco Lin, Wen-Yi Chang, Whey-Fone Tsai, Chien-Chou Shih
Exploring Potential Use of Mobile Phone Data Resource to Analyze Inter-regional Travel Patterns in Japan

In Japan, Inter-Regional Travel Survey gives rich information to researchers and transportation planners. The current survey data was conducted in 2010, and the newest survey data collected in 2015 will be available soon. This national survey is mainly based on the on-site questionnaire survey which requires an enormous budget and spends so much time to finalize and publish the data result. Recently, ubiquitous mobile computing and the big data give us new opportunities for exploring a new type of data resource besides the traditional survey data. This study clarifies the deviation of cell phone data at aggregated origin-destination level of inter-regional trip flows, compared with the traditional on-site passenger survey. Also, the mechanisms of inter-regional trip generation are explained through travel patterns by a classification tree analysis, one of the big data mining classification algorithms.

Canh Xuan Do, Makoto Tsukai
REBUILD: Graph Embedding Based Method for User Social Role Identity on Mobile Communication Network

Inferring users’ social role on a mobile communication network is of significance for various applications, such as financial fraud detection, viral marketing, and target promotions. Different with the social network, which has lots of user generated contents (UGC) including texts, pictures, and videos, considering the privacy issues, mobile communication network only contains the communication pattern data, such as message frequency and phone call frequency as well as duration. Moreover, the profile data of mobile users is always noisy, ambiguous, and sparse, which makes the task more challenging. In this paper, we use the graph embedding methods as a feature extractor and then combine it with the hand-crafted structure-related features in a feed-forward neural network. Different with previous embedding methods, we consider the label info while sampling the context. To handle the noisy and sparsity challenge, we further project the generated embedding onto a much smaller subspace. Through our experiments, we can increase the precision by up to 10% even with a huge portion of noisy and sparse labeled data.

Jinbo Zhang, Yipeng Chen, Shenda Hong, Hongyan Li
Study a Join Query Strategy Over Data Stream Based on Sliding Windows

Data stream sliding window is a common query method, but traditional approaches can get inaccurate data results. Due to the blocked character of join and the infinite character of data stream, there must be some constraints on join operations. In order to solve these constraints, this paper proposes a strategy based on load shedding techniques for sliding window aggregation queries over data stream for basic query processing and optimization. The new strategy supports multiple streams and multiple queries. Through experimental tests, the results show that the new strategy based on load shedding techniques can greatly improve query processing efficiency. Finally, we make a comparison to other join query strategies over data stream to verify the new method. Join query strategy over data stream based on sliding windows is an effective method, which can effectively reduce query processing time.

Yang Sun, Lin Teng, Shoulin Yin, Jie Liu, Hang Li

Data Mining

Frontmatter
L2 Learners’ Proficiency Evaluation Using Statistics Based on Relationship Among CEFR Rating Scales

In this paper, aiming at an objective evaluation of second language (L2) learners’ proficiencies, it was tried to predict the learners’ language proficiency using 94 statistics. The statistics were extracted automatically and manually from English conversation data collected from groups of Japanese English learners at educational institutions and were classified into 5 subcategories. To estimate the learners’ English proficiencies represented as Central European Framework of Reference (CEFR) Global Scale scores, canonical correlation analysis was performed on the statistics and the 5 subcategories, and their correlations to CEFR Global Scale scores were analyzed. As the result of the analysis, 24 statistics were selected for predicting the learners’ English proficiencies. The estimation experiment was carried out using a neural network trained by data set of 135 learners and the 24 statistics matrixes in cross-validation. An overall correlation of 0.894 was shown between the predicted proficiency scores and the L2 learners’ actual CEFR Global Scale scores. These results confirmed the usefulness of the 24 statistical measures out of the beginning set of 94 measures in the objective evaluation of L2 language proficiency.

Hajime Tsubaki
Research Hotspots and Trends in Data Mining: From 1993 to 2016

Data mining, which is also referred to as knowledge discovery in databases, means a process of nontrivial extraction of implicit, previously unknown and potentially useful information from data in databases. This paper was to explore a Bibliometric approach to quantitatively assessing current research hotspots and trends on Data Mining, using the related literature in the Science Citation Index (SCI) database from 1993 to 2016. It shows that the research of Data Mining in 2016 was in the mature period with a maturity of 85.55%, the total of 11071 articles covered 131 countries(regions) and Top 3 countries(regions) were USA(2311, 21.37%), China(1474, 13.63%) and Taiwan (904, 8.36%). In addition, Top 10 keywords are found to have citation bursts: big data, social network, particle swarm optimization, data warehouse, gene expression, self-organizing map, intrusion detection, recommender system, bioinformatics, svm. This study provided scholars in the data mining research, as well as research hotspots and future research directions.

Zili Li, Li Zeng
Proposal for a New Reduct Method for Decision Tables and an Improved STRIM

Rough Sets theory is widely used as a method for estimating and/or inducing the knowledge structure of if-then rules from a decision table after a reduct of the table. The concept of a reduct is that of constructing a decision table by necessary and sufficient condition attributes to induce the rules. This paper retests the reduct by the conventional methods by the use of simulation datasets after summarizing the reduct briefly and points out several problems of their methods. Then, a new reduct method based on a statistical viewpoint is proposed and confirmed to be valid by applying it to the simulation datasets. The new reduct method is incorporated into STRIM (Statistical Test Rule Induction Method), and plays an effective role for the rule induction. The STRIM including the reduct method is also applied for a UCI dataset and shows to be very useful and effective for estimating if-then rules hidden behind the decision table of interest.

Jiwei Fei, Tetsuro Saeki, Yuichi Kato
Post-marketing Drug Safety Evaluation Using Data Mining Based on FAERS

Healthcare is going through a big data revolution. The amount of data generated by healthcare is expected to increase significantly in the coming years. Therefore, efficient and effective data processing methods are required to transform data into information. In addition, applying statistical analysis can transform the information into useful knowledge. We developed a data mining method that can uncover new knowledge in this enormous field for clinical decision making while generating scientific methods and hypotheses. The proposed pipeline can be generally applied to a variety of data mining tasks in medical informatics. For this study, we applied the proposed pipeline for post-marketing surveillance on drug safety using FAERS, the data warehouse created by FDA. We used 14 kinds of neurology drugs to illustrate our methods. Our result indicated that this approach can successfully reveal insight for further drug safety evaluation.

Rui Duan, Xinyuan Zhang, Jingcheng Du, Jing Huang, Cui Tao, Yong Chen
Crowd Density Estimation from Few Radio-Frequency Tracking Devices: I. A Modelling Framework

This study proposes a modeling framework based on few radio frequency (RF) tracking devices, such as smartphones or beacon. The proposed framework aims to estimate crowd density continuously where vision analysis is unreliable and the relation between pedestrian speed and density can be at least specified. The crowd density estimated through the modelling framework can not only be used for evacuation commanding at emergency times, but also can be used for commercial usage at a normal time and building/facility layout improvement during design time. In the proposed framework, the application level maps input data spaces into feature spaces. The model level applies multiple data models to increase the accuracy of the estimated states. Moreover, the abstract level fuses the heterogeneous parameters estimated from the model level. The models we included in the framework are cellular automata models, ferromagnetic models, social force models, and complexity models. The model parameters are estimated by Markov Chain Monte Carlo (MCMC) and particle swarm optimization (PSO) methods. The fusion algorithm factory instantiates a data assimilation approach and a continuous receiver operating characteristic (ROC) estimator.

Yenming J. Chen, Albert Jing-Fuh Yang

Text Mining

Frontmatter
Mining Textual Reviews with Hierarchical Latent Tree Analysis

Collecting feedback from customers is an important task of any business if they hope to retain customers and improve their quality of service. Nowadays, customers can enter reviews on many websites. The vast number of textual reviews make it difficult for customers or businesses to read directly. To analyze text data, topic modeling methods are usually used. In this paper, we propose to analyze textual reviews using a recently developed topic modeling method called hierarchical latent tree analysis, which has been shown to produce topic hierarchy better than some state-of-the-art topic modeling methods. We test the method using textual reviews written about restaurants on the Yelp website. We show that the topic hierarchy reveals useful insights about the reviews. We further show how to find interesting topics specific to locations.

Leonard K. M. Poon, Chun Fai Leung, Nevin L. Zhang
Extraction of Temporal Events from Clinical Text Using Semi-supervised Conditional Random Fields

The huge amount of clinical text in Electronic Medical Records (EMRs) has opened a stage for text processing and information extraction for healthcare and medical research. Extracting temporal information in clinical text is much more difficult than the newswire text due to implicit expression of temporal information, domain-specific nature and lack of writing quality, among others. Despite of these constraints, some of the existing works established rule-based, machine learning and hybrid methods to extract temporal information with the help of annotated corpora. However obtaining the annotated corpora is costly, time consuming and requires much manual effort and thus their small size inevitably affects the processing quality. Motivated by this fact, in this work we propose a novel two-stage semi-supervised framework to exploit the abundant unannotated clinical text to automatically detect temporal events and then subsequently improve the stability and the accuracy of temporal event extraction. We trained and evaluated our semi-supervised model with the selected features of testing dataset, resulting F-measure of 89.76% for event extraction.

Gandhimathi Moharasan, Tu-Bao Ho
Characteristics of Language Usage in Inquires Asked to an Online Help Desk

We study how to characterize usage of words in inquiries from customers to an online help desk based on their statistical properties such as the number of being used and their correlation in inquires. We also investigate possibility that such statistical analysis enables us to foresee difficulties in dealing with inquiries.

Haruka Adachi, Mikito Toda

Deep Learning

Frontmatter
A Novel Diagnosis Method for SZ by Deep Neural Networks

Single nucleotide polymorphism (SNP) data are typical high-dimensional and low-sample size (HDLSS) data, and they are extremely complex. In this paper, by using a deep neural network with a loci filter method, multi-level abstract features of SNPs data are obtained. Based on the abstract features, we get the diagnosis results for schizophrenia. It shows that the performance of the deep network is better than those of other methods, i.e., linear SVM with soft margin, SVM with multi-layer perceptron kernel, SVM with RBF kernel, sparse representation based classifier and k-nearest neighbor method. These results indicate that the use of deep networks offers a novel approach to deal with HDLSS problem, especially for the medical data analysis.

Chen Qiao, Yan Shi, Bin Li, Tai An
A Novel Data-Driven Fault Diagnosis Method Based on Deep Learning

Mechanical fault diagnosis is an essential means to reduce maintenance cost and ensure safety in production. Aiming to improve diagnosis accuracy, this paper proposes a novel data-driven diagnosis method based on deep learning. Nonstationary signals are preprocessed. A feature learning method based on deep learning model is designed to mine features automatically. The mined features are identified by a supervised classification method – support vector machine (SVM). Thanks to mining features automatically, the proposed method can overcome the weakness that manual feature extraction depends on much expertise and prior knowledge in traditional data-driven diagnosis method. The effectiveness of the proposed method is validated on two datasets. Experimental results demonstrate that the proposed method is superior to the traditional data-driven diagnosis methods.

Yuyan Zhang, Liang Gao, Xinyu Li, Peigen Li

High Performance Computing

Frontmatter
Simulated Precipitation and Reservoir Inflow in the Chao Phraya River Basin by Multi-model Ensemble CMIP3 and CMIP5

Climate Change caused by global warming is a growing public concern throughout the world. It is well accepted within the scientific community that an ensemble of different projections is required to achieve robust climate change information for a specific region. For this purpose we have compiled a Multi-Model Ensemble and performed statistical downscaling for 9 GCMs of CMIP3 and CMIP5. The observed precipitation data from 83 stations around the country were interpolated to grid data using the Inverse Distance Weighted method. The precipitation projection was downscaled by the Distribution Mapping for the near-future (2010–2039), the mid-future (2040–2069) and the far-future (2070–2099). The nonlinear autoregressive neural network with exogenous input (NARX) was used to forecast the mean monthly inflow to reservoirs. The projection inflow for the future periods are shown to increase in inflow in the wet season. A possibility of increase in hydrological extreme flood in the wet season may be indicated by these findings.

Thannob Aribarg, Seree Supratid
An IDL-Based Parallel Model for Scientific Computations on Multi-core Computers

Parallel computing is an efficient way to improve the efficiency of scientific computations. However, most of current parallel methods are implemented based on massage passing interface and very complicated for researchers. This study presents a new parallel model based on interactive data language. This paper specifically designed and described the parallel principles, strategies, architectures, and algorithms of the model. An experiment of a time series extraction was conducted to evaluate its performance, and the result illustrated that this model can significantly improve efficiency of scientific computations. Additionally, the model can be easily extended by third-part modules or toolkits. This study provides a general and upper-layer parallel model for scientists and engineers in scientific community.

Weili Kou, Lili Wei, Changxian Liang, Ning Lu, Qiuhua Wang
Accelerating Redis with RDMA Over InfiniBand

Redis is an open source high-performance in-memory key-value database supporting data persistence. Redis maintains all of the data sets and intermediate results in the main memory, using periodical persistence operations to write data onto the hard disk and guarantee the persistence of data. InfiniBand is usually used in high-performance computing domains because of its very high throughput and very low latency. Using RDMA technology over InfiniBand can efficiently improve network-communication’s performance, increasing throughput and reducing network latency while reducing CPU utilization. In this paper, we propose a novel RDMA based design of Redis, using RDMA technology to accelerate Redis, helping Redis show a superior performance. The optimized Redis not only supports the socket based conventional network communication but also supports RDMA based high-performance network communication. In the high-performance network communication module of optimized Redis, Redis clients write their requests to the Redis server by using RDMA writes over an Unreliable Connection and the Redis server uses RDMA SEND over an Unreliable Datagram to send responses to Redis clients. The performance evaluation of our novel design reveals that when the size of key is fixed at 16 bytes and the size of value is 3 KB, the average latency of SET operations of RDMA based Redis is between 53 µs and 56 µs. This is about two times faster than IPoIB based Redis. And we also present a dynamic Registered Memory Region allocation method to avoid memory waste.

Wenhui Tang, Yutong Lu, Nong Xiao, Fang Liu, Zhiguang Chen

Knowledge Base and its Framework

Frontmatter
Application of Decision Trees in the Development of a Knowledge Base for a System of Support for Revitalization Processes

The objective of the paper was to develop an efficient system supporting the management of degraded areas and their revitalization. The author developed a knowledge base for a system of assessment and support of revitalization processes through the application of selected data mining methods. The database included more than 100 objects for which approximately 100 attributes on different measurement scales were collected. The analysis of the collected data involved the application of decision trees. The intermediate goal was the determination of the applicative potential of properly transformed spatial data for the requirements of revitalization procedures. The studies carried out represent early work in this scientific field, and the author provides a methodological basis for further research.

Agnieszka Turek
ABC Metaheuristic Based Optimized Adaptation Planning Logic for Decision Making Intelligent Agents in Self Adaptive Software System

The potential of machine intelligence is enormously increasing with a vision of computing systems that can act as good decision making and self managing entities. This led to the introduction of systems that are more intelligent with self* properties and are known as Self Adaptive Software Systems (SAS). Intelligent Agents which has a high adaptation capability forms the main component of such systems. These self adaptive systems are provided with the ability of self–configuring based on the run time environmental changes which guarantee the overall system functional and QoS goals. This paper proposes an optimized decentralized adaptation logic for modeling SAS which exploits the multi-agent concept. Each subsystem has an objective and uses an Artificial Bee Colony metaheuristic to achieve local optimization which in turn leads to the optimization of the whole distributed system.

Binu Rajan, Vinod Chandra
A Knowledge-Based Framework for Mitigating Hydro-Meteorological Disasters

A knowledge-based framework for mitigation of hydro-meteorological disasters in Brunei is proposed where a data mining process is used to predict anomalous intense rainfalls that were causing destructive floods and landslides. A previous study pointed the causes to anomalous oceanographic and atmospheric conditions. This expert knowledge and satellite data is used to create the model. Interoperable collaborative platforms are also crucial. This approach can potentially alter the prevailing disaster management where reactive response dominated the proactive bottom-up approach of disaster mitigation based on expert knowledge-based prediction from data mining.

Pg. Hj. Asmali Pg. Badarudin, Thien Wan Au, Somnuk Phon-Amnuaisuk

Fuzzy Control

Frontmatter
Improved Stability and Stabilization Criteria for T-S Fuzzy Systems with Distributed Time-Delay

This paper investigates the stability and stabilization analysis problem of nonlinear systems with distributed time-delay. The T-S fuzzy model is employed to describe the nonlinear plant. When designing fuzzy controller, the novel imperfect premise matching method is adopted, which allows the fuzzy model and the fuzzy controller to use different premise membership functions and different number of rules. As a result, greater design flexibility can be obtained. By applying a new tighter integral inequality which involves information about the double integral of the system states, and introducing the information of membership functions, less conservative stability and stabilization conditions are derived. Finally, a numerical example is provided to clarify the effectiveness of the proposed approach.

Qianqian Ma, Hongwei Xia, Guangcheng Ma, Yong Xia, Chong Wang
Adaptive Neuro-Fuzzy Inference System: Overview, Strengths, Limitations, and Solutions

Adaptive neuro-fuzzy inference system (ANFIS) is efficient estimation model not only among neuro-fuzzy systems but also various other machine learning techniques. Despite acceptance among researchers, ANFIS suffers from limitations that halt applications in problems with large inputs; such as, curse of dimensionality and computational expense. Various approaches have been proposed in literature to overcome such shortcomings, however, there exists a considerable room of improvement. This paper reports approaches from literature that reduce computational complexity by architectural modifications as well as efficient training procedures. Moreover, as potential future directions, this paper also proposes conceptual solutions to the limitations highlighted.

Mohd Najib Mohd Salleh, Noureen Talpur, Kashif Hussain
Acquisition of Knowledge in the Form of Fuzzy Rules for Cases Classification

We consider an approach to automatic knowledge acquisition through machine learning based on integration of two basic paradigms of reasoning – case-based and rule-based reasoning. Case-based reasoning allows to use high-performance database technology for storing and accumulating cases, while rule-based reasoning is the most developed technology for creating declarative knowledge on the basis of strong logical inference. We also propose an improvement of classification algorithm through extraction of fuzzy rules from cases. We have obtained higher classification accuracy for various membership functions and for sequentially reducing amount of training sample by application of special strategies for expanding the scope of fuzzy rules in the control sample classification.

Tatiana Avdeenko, Ekaterina Makarova
Backmatter
Metadaten
Titel
Data Mining and Big Data
herausgegeben von
Ying Tan
Hideyuki Takagi
Yuhui Shi
Copyright-Jahr
2017
Electronic ISBN
978-3-319-61845-6
Print ISBN
978-3-319-61844-9
DOI
https://doi.org/10.1007/978-3-319-61845-6