Skip to main content

2020 | Buch

Data Science

6th International Conference, ICDS 2019, Ningbo, China, May 15–20, 2019, Revised Selected Papers

herausgegeben von: Prof. Jing He, Dr. Philip S. Yu, Yong Shi, Dr. Xingsen Li, Zhijun Xie, Dr. Guangyan Huang, Jie Cao, Dr. Fu Xiao

Verlag: Springer Singapore

Buchreihe : Communications in Computer and Information Science

insite
SUCHEN

Über dieses Buch

This book constitutes the refereed proceedings of the 6th International Conference on Data Science, ICDS 2019, held in Ningbo, China, during May 2019.

The 64 revised full papers presented were carefully reviewed and selected from 210 submissions.

The research papers cover the areas of Advancement of Data Science and Smart City Applications, Theory of Data Science, Data Science of People and Health, Web of Data, Data Science of Trust and Internet of Things.

Inhaltsverzeichnis

Frontmatter
Correction to: Data Science

In the originally published version the affiliation of the editor Xingsen Li on page IV has been corrected.

Jing He, Philip S. Yu, Yong Shi, Xingsen Li, Zhijun Xie, Guangyan Huang, Jie Cao, Fu Xiao

Advancement of Data Science and Smart City Applications

Frontmatter
Application of Bayesian Belief Networks for Smart City Fire Risk Assessment Using History Statistics and Sensor Data

Fires become one of the common challenges faced by smart cities. As one of the most efficient ways in the safety science field, risk assessment could determine the risk in a quantitative or qualitative way and recognize the threat. And Bayesian Belief Networks (BBNs) has gained a reputation for being powerful techniques for modeling complex systems where the variables are highly interlinked and have been widely used for quantitative risk assessment in different fields in recent years. This work is aimed at further exploring the application of Bayesian Belief Networks for smart city fire risk assessment using history statistics and sensor data. The dynamic urban fire risk assessment method, Bayesian Belief Networks (BBNs), is described. Besides, fire risk associated factors are identified, thus a BBN model is constructed. Then a case study is presented to expound the calculation model. Both the results and discussion are given.

Jinlu Sun, Hongqiang Fang, Jiansheng Wu, Ting Sun, Xingchuan Liu
Scheduling Multi-objective IT Projects and Human Resource Allocation by NSVEPSO

In any information technology enterprise, resource allocation and project scheduling are two important issues to reduce project duration, cost and risk in multi-project environments. This paper proposes an integrated and efficient computational method based on multi-objective particle swarm optimization to solve these two interdependent problems simultaneously. Minimizing the project duration, cost and maximizing the quality of resource allocation are all considered in our approach. Moreover, we suggest a novel non-dominated sorting vector evaluated particle swarm optimization (NSVEPSO). In order to improve its efficiency, this algorithm first uses a novel method for setting the global best position, and then executes a non-dominated sorting process to select new population. The performance of NSVEPSO is evaluated by comparison with SWTC_NSPSO, VEPSO and NSGA-III. The results of four experiments in the real scenario with small, medium and large data sizes show that NSVEPSO provides better boundary solutions and costs less time than the other algorithms.

Yan Guo, Haolan Zhang, Chaoyi Pang
Dockless Bicycle Sharing Simulation Based on Arena

Shared bicycle sites vehicle imbalance is very common. When users arrive at a site to rent or return a bicycle, they often encounter the problem of “no bicycle to borrow” and “no land to return”. Existing research, in response to the problem of unbalanced site demand, most scholars predict the demand for bicycle sites. In this study, the use of public bicycles at the site is analyzed from the perspective of simulation. The Arena simulation software is used as a tool to build a shared bicycle operation model, and three shared bicycle sites are established to simulate the user’s arrival, riding, and bicycle use. Based on the simulation results, the unbalanced sites are determined. For unbalanced sites, use OptQuest to find the best decision-making plan. By changing the initial volume of bicycles at the site, reduce the number of users who can’t be rented, the excess number of bicycles at the site, and the number of users waiting in the queue.

Wang Chunmei
Simplification of 3D City Models Based on K-Means Clustering

With the development of smart cities, 3D city models have expanded from simple visualization to more applications. However, the data volume of 3D city models is also increasing at the same time, which brings great pressure to data storage and visualization. Therefore, it is necessary to simplify 3D models. In this paper, a three-step simplification method is proposed. Firstly, the geometric features of the building are used to extract the walls and roof of the building separately, and then the ground plan and the single-layer roof are extracted by the K-Means clustering algorithm. Finally, the ground plan is raised to intersect with the roof polygon to form a simplified three-dimensional city model. In this paper, experiments are carried out on a certain number of 3D city models of CityGML format. The compression ratio of model data is 92.08%, the simplification result shows better than others.

Hui Cheng, Bingchan Li, Bo Mao
Comprehensive Evaluation Model on New Product Introduction of Convenience Stores Based on Multidimensional Data

The introduction of new products is one of the important means to ensure the vitality of convenience stores. Convenience stores introduce hundreds of new products every month, so it is crucial for them to make decisions quickly and conveniently. In order to give advices to convenience stores, a comprehensive evaluation model of new product introduction in convenience stores based on multi-dimensional data is proposed. Firstly, based on theories of multidimensional data and snowflake schema, a snowflake model designed for new product introduction in convenient stores was established, which includes 4 dimensions: supplier, product, consumer and competitors. Secondly, according to the constructed snowflake model and the theories of comprehensive evaluation, subject weighting method was applied and 23 indicators were established for decision-makers to evaluate new products by having comparison with existing products. Furthermore, decision-makers can use the formula to calculate the score of new products and judging whether new products should be introduced or not according to the score. Finally, in order to test the operability of the comprehensive evaluation model, Chu Orange in Anda convenience stores was analyzed, which lead to conclusion that Chu Orange should be introduced. And the subsequent increased sales illustrates the validity of the proposed model.

Hongjuan Li, Ding Ding, Jingbo Zhang
Lane Marking Detection Algorithm Based on High-Precision Map

In case of sharp road illumination changes, bad weather such as rain, snow or fog, wear or missing of the lane marking, the reflective water stain on the road surface, the shadow obstruction of the tree, and mixed lane markings and other signs, missing detection or wrong detection will occur for the traditional lane marking detection algorithm. In this manuscript, a lane marking detection algorithm based on high-precision map is proposed. The basic principle of the algorithm is to use the centimeter-level high-precision positioning combined with high-precision map data to complete the detection of lane markings. The experimental results show that the algorithm has lower false detection rate in case of bad road conditions, and the algorithm is robust.

Haichang Yao, Chen Chen, Shangdong Liu, Kui Li, Yimu Ji, Ruchuan Wang
Measurement Methodology for Empirical Study on Pedestrian Flow

This paper reviews the measurement methodology for empirical study on pedestrian flow. Three concerns in the analysis of pedestrian dynamics were discussed, separately self-organized behaviors, fundamental diagrams and crowd anomaly detection. Various measurements were put forward by researchers which enriched pedestrian walking characteristics, while it still needs to develop effective measurement methodology to understand and interpret individual and mass crowd behaviors.

Liping Lian, Jiansheng Wu, Tinghui Qin, Jinhui Hu, Chenyang Yan
Discovering Traffic Anomaly Propagation in Urban Space Using Traffic Change Peaks

Discovering traffic anomaly propagation enables a thorough understanding of traffic anomalies and dynamics. Existing methods, such as STOTree, are not accurate for two reasons. First, they discover the propagation pattern based on the detected anomalies. The imperfection of the detection method itself may introduce false anomalies and miss the real anomaly. Second, they develop a propagation tree of anomalies by searching continuous spatial and temporal neighborhoods rather than considering from a global perspective, and thus cannot find a complete propagation tree if a spatial or temporal gap exists. In this paper, we propose a novel discovering traffic anomaly propagation method using traffic change peaks, which can visualize the change of traffic anomalies (e.g., congestion and evacuation area) and thus accurately captures traffic anomaly propagation. Inspired by image processing techniques, the GPS trajectory dataset in each time period can be converted to one grid traffic image and be stored in the grid density matrix, in which the grid cell corresponds to the pixel and the density of grid cells corresponds to the Gray level (0–255) of pixels. An adaptive filter is developed to generate traffic change graphs from grid traffic images in consecutive periods, and clustering traffic change peaks along the road is to discover the propagation of traffic anomalies. The effectiveness of the proposed method has been demonstrated using a real-world GPS trajectory dataset.

Guang-Li Huang, Yimu Ji, Shangdong Liu, Roozbeh Zarei
Forecasting on Electricity Consumption of Tourism Industry in Changli County

In recent years, tourism become more popular, and analyzing electricity consumption in tourism industry contributes to its development. To predict energy consumption, this paper applies a new model, NEWARMA model, which means to add the variable’s own medium- and long-term cyclical fluctuations item to the basic ARMA model, and the prediction accuracy will be significantly improved. This paper also compares fitting result of NEWARMA to neural network models and grey models, and finds that it performs better. Finally, through simulation analysis, this study finds that when electricity in one industry declines, other industries may be affected and changed too, which help our country to control total energy consumption in the society.

Zili Huang, Zhengze Li, Yongcheng Zhang, Kun Guo
Application of Power Big Data in Targeted Poverty Alleviation—Taking Poverty Counties in Jiangxi Province as an Example

Targeted poverty alleviation is an important measure to promote China’s all-round development, but traditional economic surveys and statistics are limited by multiple factors, making it difficult to accurately identify poor targets in a timely manner. The development of power big data provides the possibility to use energy consumption data to locate and identify poor areas. Therefore, this article takes Jiangxi Province as an example to analyze 23 regions that have been classified as poverty-stricken counties (8 counties have been separated from the list of impoverished counties). First, panel data regression is performed to prove that electricity sales can be used to analyze and predict regional economic development. Then, using decision tree ID3 algorithm and four neural network algorithms to classify and forecast poor and non-poor counties, it is found that ID3 algorithm has good fitting and prediction accuracy. Therefore, power big data can be applied to the work of targeted poverty alleviation, and has a good prospect.

Jing Mengtong, Liu Kefan, Huang Zili, Guo Kun
Algorithms Research of the Illegal Gas Station Discovery Based on Vehicle Trajectory Data

As motor vehicles are increasing, the demand for gas stations is rising. Because of the rising profits of gas stations, many traders have built illegal gas stations. The dangers of illegal gas stations are enormous. The government has always used traditional manual methods for screening illegal gas stations. How to quickly and effectively mine illegal gas stations in the trajectory data becomes a problem. This paper proposes an illegal gas station clustering discovery algorithm for unmarked trajectory data. The algorithm mines the suspected fueling point set and frequent staying point set of a single vehicle. Through the difference between the two, the suspected points of the illegal gas stations in the single vehicle trajectory are obtained, and finally all the illegal gas station suspicious points of the same type of vehicles are clustered to find the illegal gas station.

Shaobin Lu, Guilin Li
Improving Investment Return Through Analyzing and Mining Sales Data

Improving Research and Development (R&D) Investment Return is a vital factor to the survival of enterprises. Nowadays, the increasing demands for customized video surveillance products have accelerated the data analysis on commonality of sales data; and placed focus on the main business that is of great significance for companies’ development. However, identifying and developing innovative products are becoming a major difficulty in most R&D and manufacturing companies. This paper intends to apply K-means and K–modes clustering method to identify the most important factors impacting sales data and forecast the trend of video surveillance products through Decision Tree classification method based on a real video surveillance company, i.e. Zhejiang U technologies Company Limited (abbrev. U). Through this work we could improve the quality of decision-making before R&D projects started and effectively take product development as an investment to manage.

Xinzhe Lu, Haolan Zhang, Ke Huang

Theory of Data Science

Frontmatter
Study on Production Possibility Frontier Under Different Production Function Assumptions

Increasing marginal rate of transformation (MRT) in production is a generally accepted economic presumption. So how to reflect the increasing MRT in linear and non-linear production functions? However, in Leontief’s input-output model, it is assumed that production is performed on a fixed proportion of inputs and usually leads to the constant MRT. This paper analyzes the linear output possibility frontier under the Leontief production function in detail, the author tried to fix this problem by considering the non-unique primary inputs in a simplified two-sector economy. Then we discuss the production possibility frontier under the assumption of nonlinear production function: First, when there are many primary inputs (heterogeneity) constraints, it is possible to curve the net output possibility frontier of Leontief production function and make it meet the assumption of increasing MRT. Second, in the intertemporal production process, a total output possibility frontier with increasing MRT can be obtained under a general production function without the assumption of non-unique primary inputs.

Weimin Jiang, Jin Fan
The List 2-Distance Coloring of Sparse Graphs

The k-2-distance coloring of a graph G is a mapping $$c:V(G)\rightarrow \{1,2,\cdots ,k\}$$ such that for every pair of $$u,v\in V(G)$$ satisfying $$0<d_{G}(u,v)\le 2$$, $$c(u)\ne c(v)$$. A graph G is list 2-distance k-colorable if any list L of admissible colors on V(G) of size k allows a 2-distance coloring $$\varphi $$ such that $$\varphi (v)\in L(v)$$. The least k for which G is list 2-distance k-colorable is denoted by $$\chi _{2}^{l}(G)$$. In this paper, we proved that if a graph G with the maximum average degree $$mad(G)<2+\frac{9}{10}$$ and $$\bigtriangleup (G)=6$$, then $$\chi _{2}^{l}(G)\le 12$$; if a graph G with $$mad(G)<2+\frac{4}{5}(\mathrm {resp.} mad(G)<2+\frac{17}{20})$$ and $$\bigtriangleup (G)=7$$, then $$\chi _{2}^{l}(G)\le 11(\mathrm {resp.} \chi _{2}^{l}(G)\le 12)$$.

Yue Wang, Tao Pan, Lei Sun
Multilingual Knowledge Graph Embeddings with Neural Networks

Multilingual knowledge graphs constructed by cross-lingual knowledge alignment have attracted increasing attentions in knowledge-driven cross-lingual research fields. Although many existing knowledge alignment methods such as MTransE based on linear transformations perform well on cross-lingual knowledge alignment, we note that neural networks with stronger nonlinear capacity of capturing alignment features. This paper proposes a knowledge alignment neural network named KANN for multilingual knowledge graphs. KANN combines a monolingual neural network for encoding the knowledge graph of each language into a separated embedding space, and a alignment neural network for providing transitions between cross-lingual embedding spaces. We empirically evaluate our KANN model on cross-lingual entity alignment task. Experimental results show that our method achieves significant and consistent performance, and outperforms the current state-of-the-art models.

Qiannan Zhu, Xiaofei Zhou, Yuwen Wu, Ping Liu, Li Guo
Sparse Optimization Based on Non-convex Regularization for Deep Neural Networks

With the arrival of big data and the improvement of computer hardware performance, deep neural networks (DNNs) have achieved unprecedented success in many fields. Though deep neural network has good expressive ability, its large model parameters which bring a great burden on storage and calculation is still a problem remain to be solved. This problem hinders the development and application of DNNs, so it is worthy of compressing the model to reduce the complexity of the deep neural network. Sparsing neural networks is one of the methods to effectively reduce complexity which can improve efficiency and generalizability. To compress model, we use regularization method to sparse the weights of deep neural network. Considering that non-convex penalty terms often perform well in regularization, we choose non-convex regularizer to remove redundant weights, while avoiding weakening the expressive ability by not removing neurons. We borrow the strength of stochastic methods to solve the structural risk minimization problem. Experiments show that the regularization term features prominently in sparsity and the stochastic algorithm performs well.

Anda Tang, Rongrong Ma, Jianyu Miao, Lingfeng Niu
LSR-Forest: An LSH-Based Approximate k-Nearest Neighbor Query Algorithm on High-Dimensional Uncertain Data

Uncertain data is widely used in many practical applications, such as data cleaning, location-based services, privacy protection and so on. With the development of technology, the data has a tendency to high-dimensionality. The most common indexes for nearest neighbor search on uncertain data are the R-Tree and the KD-Tree. These indexes will inevitably bring about “curse of dimension”. Focus on this problem, this paper proposes a new hash algorithm, called the LSR-forest, which based on the locality-sensitive hashing and the R-Tree, to solve the high-dimensional uncertain data approximate neighbor search problem. The LSR-forest can hash similar high dimensional uncertain data into a same bucket with a high probability, and then constructs multiple R-Tree-based indexes for hashed buckets. When querying, it is possible to judge neighbors by checking the data in the hypercube which the query point is in. One can also adjust the query range automatically by different parameter k. Many experiments on different data sets are presented in this paper. The results show that LSR-forest has better effectiveness and efficiency than R-Tree on high-dimensional datasets.

Jiagang Wang, Tu Qian, Anbang Yang, Hui Wang, Jiangbo Qian
Fuzzy Association Rule Mining Algorithm Based on Load Classifier

In this paper, the fuzzy association algorithm based on Load Classifier is proposed to study the fuzzy association rules of numerical data flow. A method of dynamic partitioning of data blocks by load classifier for data stream is proposed, and the membership function of design optimization is proposed. The FP-Growth algorithm is used to realize the parallelization processing of fuzzy association rules. First, based on the load balancing classifier, variable window is proposed to divide the original data stream. Second, the continuous data preprocessing is performed and is converted into fuzzy interval data by the improved membership function. Finally through simulation experiments of the Load Classifier, compared with the four algorithms, the data processing time is similar after convergence, and the data processing time of SDBA (Spark Dynamic Block Adjustment Spark) is lower than 25 ms.

Jing Chen, Hui Zheng, Peng Li, Zhenjiang Zhang, Huawei Li, Wei Liu
An Extension Preprocessing Model for Multi-Criteria Decision Making Based on Basic-Elements Theory

Multiple Criteria Decision-Making (MCDM) are often contradict between their goals and criteria. Compromised or satisfied solutions usually cannot meet the practical needs well. We found the problem lies on the assumption that goals and constraints are fixed and reasonable but in fact they are extendable in practice. We present an extension preprocessing model for MCDM based on basic-elements theory. Several steps for the preprocessing of MCDM is introduced to extend the constraints, criteria or goals to obtain win-win solutions by implication analysis and transformations. It gives a way for exploring win-win solutions by extending the multi-direction information and knowledge of the constraints or goals supported by data mining and Extenics.

Xingsen Li, Siyuan Chen, Renhu Liu, Haolan Zhang, Wei Deng
A New Model for Predicting Node Type Based on Deep Learning

With the development of the Internet, a large number of data sets are generated, which contain valuable resources. Meanwhile, there are various graphical representations in real life, such as social networks, citation networks, and user networks. For user networks, there also exists rich information about entities except the network structure. Therefore, predicting the type of nodes in the network can help us quickly identify user type, citations type etc. In this paper, a new method based on deep learning is proposed to predict the class of node. Two public data sets are used as training sets. First, the node features are embedded to pre-train the neighbor’s neighborhood structure features, then the pre-trained data is used to input to the classification model, and the structural feature parameters are loaded. The final result shows that the prediction accuracy is increased by nearly 25% higher than the baseline model. The F1 scores of the model tested on the two data sets are 83.5% and 80.2%, respectively.

Bo Gong, Daji Ergu, Kuiyi Liu, Ying Cai
Flexible Shapelets Discovery for Time Series Classification

Time series classification is important due to its pervasive applications, especially for the emerging Smart City applications that are driven by intelligent sensors. Shapelets are sub-sequences of time series that have highly predictive abilities, and time series represented by shapelets can better reveal the patterns thus have better classification accuracy. Finding shapelets is challenging as its computational in-feasibility, most existing methods only finds shapelets with a same length or a few fixed length shapelets because the searching space of shapelets with arbitrary length is too large. In this paper, we improve the time series classification accuracy by discovering shapelets with arbitrary lengths. We borrow the idea of Apriori algorithm in association rule learning, that is, the superset shapelet candidates of a poor predictive shapelet candidate also have poor predictive abilities. Therefore, we propose a Flexible Shapelets Discovery (FSD) algorithm to discover shapelets with varying lengths. In FSD, shapelet candidates having the lower bound of length are discovered, and then we extend them into arbitrary lengths shapelets as long as their predictive abilities increases. Experiments conducted on 6 UCR time series datasets demonstrate that the arbitrary length shapelets discovered by FSD achieves better classification accuracy than those using fixed length shapelets.

Borui Cai, Guangyan Huang, Maia Angelova Turkedjieva, Yong Xiang, Chi-Hung Chi
Short Text Similarity Measurement Using Context from Bag of Word Pairs and Word Co-occurrence

With the rapid development of social networks, short texts have become a prevalent form of social communications on the Internet. Measuring the similarity between short texts is a fundamental task to many applications, such as social network text querying, short text clustering and geographical event detection for smart city. However, short texts in social media always show limited contextual information and they are sparse, noisy and ambiguous. Hence, effectively measuring the distance between short texts is a challenging task.In this paper, we propose a new heuristic word pair distance measurement (WPDM) technique for short texts, which exploits the corpus level word relations and enriches the context of each short text with bag of word pairs representation. We first adjust Jaccard similarity to measure the distance between words. Then, words are paired up to capture latent semantics in a short text document and thus transfer short text into a bag of word pairs representation. The similarity between short text documents is finally calculated through averaging the distances of the word pairs. Experimental results on a real-world dataset demonstrate that the proposed WPDM is effective and achieves much better performance than state-of-the-art methods.

Shuiqiao Yang, Guangyan Huang, Bahadorreza Ofoghi
Image Enhancement Method in Decompression Based on F-shift Transformation

In order to process a compressed image, such as a JPG image, a common way is to decompress the image first to get each pixel, and then process it. In this paper, we propose a method for image enhancement in the process of decompression. The image is compressed by using two-dimensional F-shift (TDFS) and two dimensional Haar wavelet transform. To enhance the image during decompression, firstly we adjust the brightness of the whole image by modifying the approximation coefficient and enhance the decompressed low frequency component part using the contrast limited adaptive histogram equalization (CLAHE) method. Finally, we decompose the remaining data and do the last step of image enhancement. Contrast with CLAHE and the state-of-art method, our method can not only combination merits of the spatial domain method and the transform domain method, but also can reduce the process complexity and maintain the compressibility of the original image.

Ruiqin Fan, Xiaoyun Li, Huanyu Zhao, Haolan Zhang, Chaoyi Pang, Junhu Wang

Data Science of People and Health

Frontmatter
Playback Speech Detection Application Based on Cepstrum Feature

With the popularity of various portable recording devices, playback speech has become one of the most important means of attack in the speaker authentication system. By comparing with the original speech data, the difference in the high-frequency layer, and the playback speech is also different in the low-frequency layer due to the different recording equipment. According to this finding, a detection algorithm was presented to extract representative data. In the high frequency layer, the inverse-Mel filters (I-Mel) is used to extract speaker eigenvector sequences. In the low frequency layer, linear filters (Linear) is combined with Mel filters (Mel) to avoid superposition of characteristic parameters. Multi-layer fusion to obtain L-M-I filter banks to form new cepstral features. The experimental results show that the method can detect playback speech effectively and the equal error rate is 2.63%. Compared with the traditional feature extraction methods (MFCC, CQCC, LFCC, IMFCC), the equal error rate decreases by 12.79%, 9.61%, 4.45% and 3.28% respectively.

Jing Zhou, Ye Jiang
Study on Indoor Human Behavior Detection Based on WISP

With the development of wearable devices, there are growing concerns about motion detection, the passive and wireless identification and sensing platform—WISP, which transmitted data in real time, was used to detect the indoor human behavior. WISP can obtain energy from the ultra-high frequency signals emitted by RFID reader, so as to provide power for its built-in low-power microcontroller and sensor, saving energy cost and using backscatter to transmit data. The transmitted EPC data contains the acceleration information collected by its built-in three-axis acceleration sensor ADXL362, the data are processed and the noise signal is removed by wavelet filtering, so as to identify the indoor human behavior. The experimental results verify the feasibility of using WISP to collect acceleration data and can effectively detect various human behaviors.

Baocheng Wang, Zhijun Xie
A Novel Heat-Proof Clothing Design Algorithm Based on Heat Conduction Theory

In order to ensure the safety of workers engaged in high temperature operation, it is of great significance to study the longest heat-resistant time of human body at a certain temperature. In this paper, we mainly used a grid based iterative algorithm. On the basis of Fourier Theorem, we established partial differential equation about temperature of clothes, thickness and time, then divide the grid in the range which calculate according to initial and boundary condition to solve a difference equation. Taylor expansion is used to solve the optimal thickness, time and other parameters by using the classical explicit format method. The result shows that the clothing temperature rises rapidly in the initial stage, and tends to be stable after reaching the critical time at about 1000 s, reaching the unsafe temperature of 47 °C at about 600 s. Compared with the existing research, our model takes the temperature change into account when the human body is the heat source, and considers the physical factors such as heat conduction, radiation, convection and so on. It can be proved that the results are more reasonable.

Yuan Shen, Yunxiao Wang, Fangxin Wang, He Xu, Peng Li
A Novel Video Emotion Recognition System in the Wild Using a Random Forest Classifier

Emotions are expressed by humans to demonstrate their feelings in daily life. Video emotion recognition can be employed to detect various human emotions captured in videos. Recently, many researchers have been attracted to this research area and attempted to improve video emotion detection in both lab controlled and unconstrained environments. While the recognition rate of existing methods is high on lab-controlled datasets, they achieve much lower accuracy rates in a real-world uncontrolled environment. This is because of a variety of challenges present in real-world environments such as variations in illumination, head pose, and individual appearance. To address these challenges, in this paper, we propose a framework to recognize seven human emotions by extracting robust visual features from the videos captured in the wild and handle the head pose variation using a new feature extraction technique. First, sixty-eight face landmarks are extracted from different video sequences. Then, the Generalized Procrustes analysis (GPA) method is employed to normalize the extracted features. Finally, a random forest classifier is applied to recognize emotions. We have evaluated the proposed method using Acted Facial Expressions in the Wild (AFEW) dataset and obtained better accuracy than three existing video emotion recognition methods. It is noticeable that the proposed system can be applied to various contextual applications such as smart homes, healthcare, game industry and marketing in a smart city.

Najmeh Samadiani, Guangyan Huang, Wei Luo, Yanfeng Shu, Rui Wang, Tuba Kocaturk
Research on Personalized Learning Path Discovery Based on Differential Evolution Algorithm and Knowledge Graph

Discovering the most adaptive learning path and content is an urgent issue for nowadays e-learning system, to achieve learning goals. The main challenge of building this system is to provide appropriate educational resources for different learners with different interests and background knowledge. The system should be efficient and adaptable. In addition, the best learning path to adapt learners can help reduce cognitive overload and disorientation. This paper proposes a framework for learning path discovery based on differential evolutionary algorithm and Knowledge graph. In the first stage, learners are investigated to form learners’ records according to their cognitive models, knowledge backgrounds, learning interests and abilities. In the second step, learners’ model database is generated, based on the classification of learners’ examination results. In the third stage, the knowledge graph based on disciplinary domain knowledge, is established. The differential evolution algorithm is then introduced as a method in the fourth stage. The framework is applied to learning path discovery based on differential evolution algorithm and disciplinary knowledge graph. The output of the system is a learning path adapted to learner’s needs and learning resource recommendation referring to the learning path.

Feng Wang, Lingling Zhang, Xingchen Chen, Ziming Wang, Xin Xu
Fall Detection Method Based on Wirelessly-Powered Sensing Platform

Falls are one of the common reasons that affect the health of the elderly. Because of its high incidence and high occasionality, the assistance rate of the elderly is lower. Therefore, the fall detection method with an accurate and timely research and development can better help patients get effective assistance. We use a wirelessly-powered sensing platform to easily wear, which can convert RF signal as its own power without replacing any battery, as well as, it can work in non-line-of-sight environment and design a fall detection method based on wirelessly-powered sensing platform. Firstly, the wirelessly-powered sensing platform collects the acceleration data of the human waist and obtains the motion acceleration and its corresponding Euler angle information. Then, combining with Discrete Wavelet Transform and Hilbert-Huang Transform, a algorithm for decomposing acceleration signals is proposed to extract signal information. Finally, an abnormal detection algorithm for Euler angle is proposed, we use the Support Vector Machine algorithm with the abnormal detection algorithm for Euler angle to detect a behavior of the fall. At the same time, in order to alleviate the pressure of power consumption, a sampling factor is set to dynamically change the sampling frequency and reduce power consumption. Experiments show that this method has a higher accuracy, which is over 94.7% of accuracy of the lowest sampling frequency. In the meantime, it has important meaning for the assistance of patients with the fall.

Tao Zhang, Zhijun Xie
Service Evaluation of Elderly Care Station and Expectations with Big Data

The problem of population aging in China has attracted wide concern, and big data is an important technological method in aging services. Considering China’s current national conditions, in the form of various elderly care, home-based elderly care will be the choice of most people, and elderly care station is the terminal institution of the home-based elderly care service system. The service quality evaluation index system for elderly care stations is designed according to service and construction standards. An evaluation model is established by using entropy method and analytic hierarchy process (AHP). Some analysis is performed based on the data of the questionnaire survey. The model can quantify the service requirements, conduct objective evaluation, and play a positive role in monitoring the development of elderly care stations. Introducing big data technology into the evaluation of the elderly services, the model will be more accurate, and be beneficial to the improvement of operational supervision and service quality.

Aihua Li, Diwen Wang, Meihong Zhu
EEG Pattern Recognition Based on Self-adjusting Dynamic Time Dependency Method

The application of biometric identification technology has been applied extensively in modern society. EEG pattern recognition method is one of the key biometric identification technologies for advanced secure and reliable identification technology. This paper introduces a novel EEG pattern recognition method based on Segmented EEG Graph using PLA (SEGPA) model, which incorporates the novel self-adjusting time series dependency method. In such a model, the dynamic time-dependency method has been applied in the recognition process. The preliminary experimental results indicate that the proposed method can produce a reasonable recognition outcome.

Hao Lan Zhang, Yun Xue, Bailing Zhang, Xingsen Li, Xinzhe Lu

Web of Data

Frontmatter
Optimal Rating Prediction in Recommender Systems

Recommendation systems are best choice to cope with the problem of information overload. These systems are commonly used in recent years help to match users with different items. The increasing amount of available data on internet in recent year’s pretenses some great challenges in the field of recommender systems. Main challenge is to predict the user preference and provide favorable recommendations. In this article, we present a new mechanism to improve the prediction accuracy in recommendations. Our method includes a discretization step and chi-square algorithm for attribute selection. Results on MovieLens dataset show that our technique performs well and minimize the error ratio.

Bilal Ahmed, Li Wang, Waqar Hussain, M. Abdul Qadoos, Zheng Tingyi, Muhammad Amjad, Syed Badar-ud-Duja, Akbar Hussain, Muhammad Raheel
A Performance Comparison of Clustering Algorithms for Big Data on DataMPI

Clustering algorithms for big data have important applications in finance. DataMPI is a communication library based on key-value pairs that extends MPI for Hadoop and Spark. We study the performance of K-means, fuzzy K-means and Canopy clustering algorithms on the DataMPI cluster by experiments. Firstly, we observe the influence of the number of nodes on the clustering time and scaleup; and then we observe the influence of the size of the memory of each node on the clustering time and memoryup; at the same time, we compare the performance of these three clustering algorithms on different text data set. From experimental results we can find that: (1) When the size of data set, the size of the memory, and the number of nodes keep the same, Canopy is the fastest, followed by K-means, and the fuzzy K-means is the slowest; (2) When the size of the memory of each node is fixed, these three algorithms have a good scaleup on all of text data set, which shows that the increase of the number of nodes can significantly improve the efficiency of these three algorithms; (3) When the number of nodes is fixed, and as the size of the memory is increased from 1 GB to 4 GB, the clustering time is significantly decreased, which shows that these three clustering algorithms have a good memoryup.

Mo Hai
A Novel Way to Build Stock Market Sentiment Lexicon

The construction of domain-specific sentiment lexicon has become an important direction to improve the performance of sentiment analysis in recent years. As one of the important application areas of sentiment analysis, the stock market also has some related researches. However, when considering the heterogeneity of the stock market relative to other fields, these studies ignore the heterogeneity of the stock market under different market conditions. At the same time, the annotated corpus is also indispensable for these studies, but the annotated corpus, especially the social media corpus that is not standardized, domain-specific and large in volume, is very difficult to obtain, manually labeling or automatic labeling has certain limitations. Besides, in the evaluation of the stock market sentiment lexicon, it is still based on the general classification algorithm evaluation criteria, but ignores the final application purpose of the sentiment analysis in the stock market: helping the stock market participants make investment decisions, that is, to achieve the highest profit. To address those problems, this paper proposes an unsupervised new method of constructing the stock market sentiment lexicon which based on the heterogeneity of the stock market, and an evaluation method of stock market sentiment lexicon. Subsequently, we selected four commonly used Chinese sentiment dictionaries as benchmark lexicons, and verified the method with an unlabeled Eastmoney stock posting corpus containing 15,733,552 posts about 2400 Chinese A-share listed companies. Finally, under our lexicon evaluation framework which based on the portfolio annualized return, the stock market sentiment lexicon constructed in this paper has achieved the best performance.

Yangcheng Liu, Fawaz E. Alsaadi
Decision Tree and Knowledge Graph Based on Grain Loss Prediction

China is an agricultural country. Agricultural production is an import part of the Chinese economic system. With the advent of the information age, plenty of data have been produced in a series of links about harvest and after-harvest, such as harvest, processing, transportation, and consumption. With proper use of these data, we can dig out more and more valuable information from the data. In this paper, the relevant algorithm of machine learning is adopted and improved to predict the grain loss after extracting the data of harvesting link. Machine learning is the core of Artificial Intelligence, and its application covers all fields. In this paper, based on the traditional machine learning algorithm—decision tree, the knowledge graph is used to make appropriate improvements to predict the grain-loss after harvest.

Lishan Zhao, Bingchan Li, Bo Mao
Research on Assessment and Comparison of the Forestry Open Government Data Quality Between China and the United States

The quality of Forestry Open Government Data (FOGD) is the basis for the construction of Forestry Open and Shared Service System. This paper, which is based on the quality criterion of Open Government Data (OGD) and forestry data, constructed a quality assessment framework for FOGD and adopts manual collection as well as network crawling methods to conduct the comparison research of FOGD platforms between China and the United States, showing that Chinese FOGD have quality problems in security, openness, comprehensiveness, sustainability, availability, metadata, etc. To encourage users to participate in innovation and value creation in leveraging forestry government data extensively, it is recommended that in the future, the degree of policy standards readiness and data openness should be improved; metadata standard should be established; comprehensive, accurate, consistent and standardized forestry government data should be continuously opened up.

Bo Wang, Jiwen Wen, Jia Zheng
A Review on Technology, Management and Application of Data Fusion in the Background of Big Data

The purpose of data fusion is to combine multi-source and heterogeneous data to make the data more valuable. Re-examining data fusion under the background of big data, technology has undergone transformation and innovation; management requires new theories such as data governance, big data chain, data sharing and security, quality evaluation and others to support; the application field is also more extensive. This paper reviews and combs the technology, management and application of data fusion in the context of big data, and finally the future prospect of big data fusion is put forward.

Siguang Chen, Aihua Li
GAN-Based Deep Matrix Factorization for Recommendation Systems

Recommendation systems can be divided into two categories: a generator model for predicting the relevant item given a user; or a discriminator model for predicting relevancy given a user-item pair. In order to combine the two models for better recommendation, we propose a novel deep matrix factorization model based on a generative adversarial network which uses collaborative graphs to relieve data sparsity. With interactive records, user-collaboration-graph and item-collaboration-graph are constructed. Then, we use the neighbor nodes’ information in collaborative graphs including user-based information and item-based information to alleviate interaction matrix data sparsity. Finally, the pre-filled matrix is fed into a deep generator and a deep discriminator respectively to learn the feature representations of users and items in a common low-dimensional space through adversarial training, which generates better top-N recommendation results. We conduct extensive experiments on two real-world datasets to demonstrate the effectiveness of our model.

Qingqin Wang, Yun Xiong, Yangyong Zhu
The Feature of the B&R Exchange Rate: Comparison with Main Currency Based on EMD Algorithm and Grey Relational Degrees

According to BELT AND ROAD PORTAL, China has signed 173 cooperative documents with 125 countries and 29 international organizations along the Belt and Road (B&R) until April 2019, and the exchange rate volatility of currencies in the B&R are usually higher. In this paper, we firstly constructed the B&R exchange rate index on the basis of both the trade volumes and the foreign investment situation. After that, we compared it with the RMB exchange rate index. First, the EMD algorithm was used to decompose each index respectively into 8 IMFs and residual signal. Afterwards, based on grey comprehensive relational degrees, we reconstructed the market fluctuation term and noise term of each exchange rate index, and also get the trend term which is the residual signal. Then, we used comparative analysis to discuss and explain the features and relationship of exchange rate risk of RMB between the B&R and against main currencies, and found that in the long term, there was more devaluation risk of the currencies and regions along the B&R; in the middle term, there was a lead-lag relationship between the two indexes, and the B&R exchange rate index is going to decline; in the short term, the exchange risk in the countries and regions along the B&R is greater than the worldwide RMB exchange rate risk. Finally, we put forward several relevant suggestions to help going-out enterprises to avoid and manage exchange rate risks effectively.

Cui Yixi, Liu Zixin, Li Ziran, Guo Kun
Dynamic Clustering of Stream Short Documents Using Evolutionary Word Relation Network

The explosive growth of web 2.0 applications (e.g., social networks, question answering forums and blogs) leads to continuous generation of short texts. Using clustering analysis to automatically categorize the stream short texts has been proved to be one of the critical unsupervised learning techniques. However, the unique attributes of short texts (e.g, few meaningful keywords, noisy features and lacking context) and the temporal dynamics of data in the stream challenge this task.To tackle the problem, in this paper, we propose a stream clustering algorithm EWNStream by exploring the Evolutionary Word relation Network. The word relation network is constructed with the aggregated word co-occurrence patterns from batch of short texts in the stream to overcome the sparse features of short text at document level. To cope with the temporal dynamics of data in the stream, the word relation network will be incrementally updated with the new arriving batches of data. The change of word relation network indicates the evolution of underlying clusters in the stream. Based on the evolutionary word relation network, we proposed a keyword group discovery strategy to extract the representative terms for the underlying short text clusters. The keyword groups are used as cluster centers to group the stream short texts. The experimental results on real-word Twitter dataset show that our method can achieve much better clustering accuracy and time efficiency.

Shuiqiao Yang, Guangyan Huang, Xiangmin Zhou, Yang Xiang
Data Exchange Engine for Parallel Computing and Its Application to 3D Chromosome Modelling

Data Exchange Engine for Parallel Computing (abbreviated as DEEPC) is a universal parallel programming interface for scientific computing environments such as MATLAB, Octave, R and Python. It is a software developed by us to support Bulk Synchronous Parallel (BSP) computing for these mainstream script-driven scientific computing environments. BSP is one of the most dominant parallel program models, and it affects the design of parallel algorithms profoundly. However, most of these scientific computing environments have been lack of the software support of BSP for a long time until the birth of DEEPC. The main features of our DEEPC is its ease of use and high performance, especially that without much modification to the sequential-computing programs, one can combine these programs to a high performance parallel program with a short script. To demonstrate these features, we put DEEPC in use to a MATLAB program for the 3D modelling of chromosomes. It has been observed that DEEPC performs very well even without much modification to the corresponding program for sequential computing. CCS Concepts. Computing methodologies → Parallel programming

Xiaoling Zhang, Yao Lu, Junfeng Wu, Yongrui Zhang
Image Quick Search Based on F-shift Transformation

Searching a given image belongs to which part of an image is a practical significance topic in computer vision, image and video processing. Commonly the images are compressed for efficient storage and transfer, while if we want to search a given image form images, we need to decompress them first and then process our task. In this paper, we give a quick image searching method based on F-shift compressed images, which means no decompression processes are needed. The basic principle lies on the attribute of F-shift transformation (similar to Haar wavelet transformation), where each of the data are quality-guaranteed. This property ensure we can just search the high frequency component of a compressed image to reach our goal. Getting benefit from the fact that no decompression process are needed, the efficiency of our method can be promoted significantly.

Tongliang Li, Ruiqin Fan, Xiaoyun Li, Huanyu Zhao, Chaoyi Pang, Junhu Wang

Data Science of Trust

Frontmatter
Functional Dependency Discovery on Distributed Database: Sampling Verification Framework

In relational databases, functional dependencies discovery is a very important database analysis technology, which has a wide range of applications in knowledge discovery, database semantic analysis, data quality assessment and database design. The existing functional dependencies discovery algorithms are mainly designed for centralized data, which are usually only applicable when the data size is small. With the rapid development of the database scale of the times, the distributed environment function dependence discovery has more and more important practical significance. A functional dependencies discovery algorithm for big data in distributed environment is proposed. The basic idea is to first perform functional dependencies discovery on the sampled data set, and then globally verify the functional dependencies that may be globally established, so that all functional dependencies can be discovered. Parallel computing can be used to improve discovery efficiency while ensuring correctness.

Chenxin Gu, Jie Cao
Bankruptcy Forecasting for Small and Medium-Sized Enterprises Using Cash Flow Data

Credit rating has long been a topic of interest in academic research. There are lots of studies about credit rating methods for large and listed companies. However, due to the lack of financial data and information asymmetry, developing credit ratings for small and medium-sized enterprises (SMEs) is difficult. To alleviate this problem, this paper adopts a novel approach, using SMEs’ cash flow data to make bankruptcy predictions and improve the accuracy of bankruptcy prediction for SMEs through feature extraction of cash flow data. We validate the prediction performance after adding features extracted from cash flow data on six supervised learning algorithms. The results show that using cash flow data can improve the performance of bankruptcy prediction for SMEs.

Yong Xu, Gang Kou, Yi Peng, Fawaz E. Alsaadi
Pairs Trading Based on Risk Hedging: An Empirical Study of the Gold Spot and Futures Trading in China

This paper builds a quantitative investment strategy, which is based on the pairs trading strategy, combined with the support vector machine model in machine learning, and supplements the technical indicators (RSI, SMA) to help the manufacturing and agricultural production sectors to hedge the risk of price fluctuations. An empirical study using gold spot as an example. It is finally found that the quantitative strategy proposed in this paper can well reflect the characteristics of financial markets helping the real economy to reduce risks, and has significant effectiveness in the Chinese market.

Shuze Guo, Wen Long
A Rectified Linear Unit Model for Diagnosing VCSEL’s Power Output

Vertical cavity surface emitting lasers (VCSELs) are broadly applied in optical communication, optical interconnection, optical information processing, and optical integrated system. Therefore, diagnosing the output power of VCSEL is of great importance from the point of application view. Traditional approaches to diagnose the output power are by the rate equation, which is easily interfered by zero-value samples. Such model is capable of capturing the relationship between the laser output power intensity and the device temperature. However, those methods may over-fitting and fall into local optimum in the fitting process. In this paper, we propose an advanced model to address these limitations. Specifically, our model adds Rectified Linear Unit (ReLU) and weight parameters to reduce the zero-value interference. Moreover, the adaptive moment estimation (Adam) algorithm is employed to learn parameters in the model, and the L2-norm is taken into consideration to prevent overfitting. The experimental results show that proposed model outperforms the base model significantly, and can be used for diagnosing VCSEL’s power output. The mean squared error (MSE) of our model is 0.0815. The Mean Absolute Percentage Error (MAPE) is 20.72%, which is 22.29% lower than the base model.

Li Wang, Wenhao Chen
Blockchain Based High Performance User Authentication in Electric Information Management System

User Authentication is essential in electric information management system where the operations should be controlled according to standard specifications. Traditionally the authentication is implemented by rule/rights definitions in the static database, which is difficult to manage and audit when there are a number of users and resources. Therefore, blockchain technology is introduced to facilitate user authentication considering its decentralized and security features. However, the performance of blockchain based authentication is relatively low compared with the centralized database. In this paper, we propose a high-performance user authentication framework for electric information management system based on multiple level structure and parallel computing. The proposed framework is implemented in a Provincial electric company of China State Grid. According to the experimental results, the proposed framework not only implements the blockchain based user authentication by also improves the system performance by 60% compared with the non-optimized system.

Cong Hu, Ping Wang, Chang Xu
A Blockchain Based Secure Data Transmission Mechanism for Electric Company

Data security is an important request for electric company that takes responsibility of energy supply for our daily lives. In this paper, a blockchain based framework is proposed to grantee the integrity and authenticity of data transmission in electric company. We employ the blockchain to implement a decentralized system that records every data transformation operations in the network and a double-account strategy is designed to fulfill the anonymity data transmission. We implement the proposed method in a test network with five nodes, and the experimental results indicate that the proposed method is effective to improve the data security in electric company.

Ping Wang, Cong Hu, Min Xu
Design and Implementation of a Blockchain Based Authentication Framework: A Case Study in the State Grid of China

Blockchain has been proved to be a promising technology for decentralized security data transmission and management, which can be applied to the user authentication framework especially for large company such as State Grid of China. In this paper we demonstrate the implementation of a blockchain based system for user authentication in electric information management system that covers many business and application aspects. The proposed framework first studies the security improvement user authentication by applying the blockchain technology and suggested a three step Consensus Mechanism based on Kafka sort function. Then the implementation of the proposed framework work is described and the features are analyzed. According to the experimental results, the proposed authentication framework for electric company is efficient and effective to improve their overall information security performance.

Cong Hu, Chang Xu, Ping Wang
Evolutionary Mechanism of Risk Factor Disclosure in American Financial Corporation Annual Report

Since 2005, most U.S. listed firms have been mandated by the Securities and Exchange Commission (SEC) to disclose an additional section, named Item 1A risk factors, in annual reports (i.e. Form-10-K filing) to discuss risk factors these firms are facing with. The current research rarely study the evolutionary mechanism of risk factor disclosure. Based on 263310 risk factors extracted from 9730 U.S. financial firm annual reports during 2006–2016, this paper draws the trends of the risk disclosure in terms of risk factor number, redundancy, specificity, fog index, sentiment subjectivity and boilerplate. The empirical analysis shows that the overall trends of these six textual attributes are arising. This paper further studies the evolutionary mechanism from the perspective of firm characteristics, regulation and financial crisis. There are two main findings. Firstly, the overall trends of these textual attributes can be explained by the changes in company characteristics. Secondly, the occurrence of important events such as the release of regulations and financial crisis can lead to leapfrogging of textual attributes.

Guowen Li, Jianping Li, Mingxi Liu, Xiaoqian Zhu
Impact of Dimension and Sample Size on the Performance of Imputation Methods

Real-world data collections often contain missing values, which can bring serious problems for data analysis. Simply discarding records with missing values tend to create bias in analysis. Missing data imputation methods try to fill in the missing values with estimated values. While numerous imputations methods have been proposed, these methods are mostly judged by their imputation accuracy, and little attention has been paid to their efficiency. With the increasing size of data collections, the imputation efficiency becomes an important issue. In this work we conduct an experimental comparison of several popular imputation methods, focusing on their time efficiency and scalability in terms of sample size and record dimension (number of attributes). We believe these results can provide a guide to data analysts when choosing imputation methods.

Yanjun Cui, Junhu Wang
Diagnosing and Classifying the Fault of Transformer with Deep Belief Network

As an important equipment of smart grid, transformer fault has a great impact on the safe and stable operation of smart grid, and therefore the transformer fault diagnosis and classification become particularly critical. This paper first introduces the application of restricted Boltzmann machine and deep belief network in transformer fault diagnosis and classification, then designs a transformer fault diagnosis and classification model based on rectified linear unit and deep belief network for a large number of transformers in smart grid, and describes in detail the selection of feature parameters, the partition of fault patterns, the analysis of sample data and the setting of model parameters in the proposed model. Finally, the efficiency and accuracy of the proposed model are tested and compared with SVM and BPNN by using the actual transformer fault data collected in daily operation. The case study shows that the proposed model can effectively achieve the transformer fault diagnosis and classification, and provides a valuable method for the transformer fault diagnosis and classification.

Lipeng Zhu, Wei Rao, Junfeng Qiao, Sen Pan

Internet of Things

Frontmatter
Extraction Method of Traceability Target Track in Grain Depot Based on Target Detection and Recognition

Food security is related to the national economy and people’s livelihood. The grain storage security is the key to achieve food security, therefore, relevant departments have attached great importance to the traceability of the warehousing link. In this paper, we use R-CNN algorithm to extract the target information related to traceability from monitoring videos in the warehouse, then match the targets in adjacent frame based on the class, location and image feature, and finally connect the same target in the adjacent frames to obtain the running track of the target in current monitoring scene. It can be seen from the experimental results that the target detection and recognition algorithm based on R-CNN can reach a higher level in recognition accuracy and detection rate, which meets the needs of real-time analysis. At the same time, the multi-factor target matching fusion algorithm can balance the matching accuracy and matching efficiency, and is a relatively better target matching method. Trajectory data extracted based on the above data can directly reflect the running route of each target. It is also easier for the public to accept such data as the evidence and basis for traceability.

Jianshu Zhang, Bingchan Li, Jiayue Sun, Bo Mao, Ali Lu
CRAC: An Automatic Assistant Compiler of Checkpoint/Restart for OpenCL Program

Nowadays, people use multiple devices to meet a growing requirement for computing. With the application of multi-card computing, fault tolerance, load balance, and resource sharing have been the hot issues and the checkpoint/restart (CR) mechanism is critical in a preemptive system. This paper proposes a checkpoint/restart framework including the automatic compiler (CRAC) to achieve a feasible checkpoint/restart system, especially for GPU applications on heterogeneous devices in OpenCL program. By offering the positions of the checkpoint/restart in source code, CRAC inserts primitives into programs and invokes the runtime support modules for final results. A comprehensive example and experiments have demonstrated the feasibility and effectiveness of proposed framework.

Genlang Chen, Jiajian Zhang, Zufang Zhu, Chaoyan Zhu, Hai Jiang, Chaoyi Pang
Design of Wireless Data Center Network Structure Based on ExCCC-DCN

A data center is a cluster of servers, which is typically an organic collection of tens of thousands of servers. The sheer number of servers determines how well its performance is related to how it is interconnected. Just like the topology of Ethernet determines its network capacity and communication characteristics, the network structure of the data center has a great impact on its performance and capacity. The research object ExCCC-DCN in this paper has excellent node capacity, and its construction cost and operating energy consumption are relatively low, which is very suitable for deploying large-scale data centers. However, ExCCC-DCN is deployed on a wired link, and a large number of wired links can make maintenance and upgrade of the data center more difficult. Therefore, this paper utilizes the advantages of high transmission rate, strong anti-interference and high security of 60 GHz millimeter wave to wirelessly transform ExCCC-DCN, making it more flexible in the case of maintaining communication efficiency.

Yanhao Jing, Zhijie Han, Xiaoyu Du, Qingfang Zhang
A New Type Wireless Data Center of Comb Topology

In recent years, group communication has become more and more popular. In the current data center network, multicast plays an extremely important role in group communication. Traditional data center network is designed based on wires. Researchers have proposed Dcell, Bcube and other data center topologies successively. However, in the traditional data center network, the wired topology structure has the disadvantages of high overhead, complex deployment, high maintenance cost, slow heat dissipation and so on. Moreover, the existing wireless data center Flyways structure and Graphite structure have the problems of low connectivity and poor scalability. So this paper proposes and designs a new data center network topology Comb structure (Comb model is a multi-layer cellular structure model, which is a multi-layer structure composed of single-layer honeycomb-like structure interlaced with each other). The structure can effectively improve the connectivity of the data center, shorten the routing path, reduce the probability of node failure in the network, and ensure the smooth network.

Qingfang Zhang, Zhijie Han, Xiaoyu Du
Toward PTNET Network Topology Analysis and Routing Algorithm Design

In recent years, data center network is used for transmission, storage and processing of big data, which plays an important role for applications in cloud computing and CDN distribution. Network topology and routing algorithm are its core research content and key technical issues. The traditional network topology is difficult to guarantee the quality of service in scalability and fault tolerance. The server-centric data center network topology can ensure the scale of the data center network by recursively increasing the number of network nodes and links. relative to the Dcell, BCube, and BCCC typical network topology, PTNet network as a typical representative of a new type of the server-centric data center network topology, which has more advantages in scalability, fault tolerance and so on. Multicast and broadcast in data center network have more application scenarios and use value. For example, the video conference online, multimedia remote education and other development are inseparable from the application and promotion of network multicast and broadcast. So it is necessary to research the routing algorithms of multicast and broadcast in the network. Based on the deep research of PTNet network, this paper analyzes and researches the network topology, multicast and broadcast routing algorithm.

Zhijie Han, Qingfang Zhang, Xiaoyu Du, Kun Guo, Mingshu He
AODV Protocol Improvement Based on Path Load and Stability

Mobile Ad Hoc Network is one of the most important and unique wireless networks independent of any network infrastructure. The whole network is mobile, and the individual nodes are allowed to move freely, and the topology changes dynamically with the movement of nodes. Nodes communication depends on routing protocol for planning and confirmation. For example, AODV protocol selects the shortest path as the single route selection metric which is easy to result in unbalanced network load and path instability, etc, as a result, performance of all aspects of the network is affected and performance is degraded. The most sensitive challenge that MANET faces is to select route metric. Many protocols select hop as the single metric, network load balancing, path stability and other problems is not considered. if the network load is too heavy, the network will be congested, path instability can also easily cause link disruption, which will affect the overall performance of the network. Based on the AODV protocol, this paper proposes a new metric which takes into full consideration of energy, velocity and path hops in order to find a path with high residual energy and high stability in the meantime. The simulation results show the proposed method can obtain better experimental results, compared with the AODV method.

Yonghang Yan, Aofeng Shang, Bingheng Chen, Zhijie Han, Qingfang Zhang
Complex Real-Time Network Topology Generation Optimization Based on Message Flow Control

There are high requirements for real-time performance of some complex systems, such as in-vehicle systems, avionics systems and so on. Large-scale message interaction within these systems constitutes a complex message interaction network, and the topology of the interaction network has a great impact on its real-time performance as different topologies can cause dramatic differences in message transmission delays. Community discovery and topological grouping are the mainly methods for network topology generation. However, these methods cannot directly guarantee real-time performance. This paper proposes a complex real-time network topology generation algorithm based on message flow control, and compares its real-time performance with manually designed network topology based on balanced strategy. Considering that the control mechanism of message flow is the main influencing factor for network real-time performance, frame length and bandwidth allocation gap (BAG) of the message in the network are measured as the influence factors in the process of network topology construction. The nodes in the network are clustered according to the tightness of communication to ensure the real-time performance of the network. Analytic methods are used to verify the real-time performance of network topology. In the detailed comparison process, the queuing strategy of message in the nodes is divided into two cases: First-In-First-Out (FIFO) and Static Priority (SP). The results show that the real-time performance of almost 74% of the message flow in the algorithm generated network topology based on flow control is better than the artificially designed network topology for the two different queuing strategies.

Feng He, Zhiyu Wang, Xiaoyan Gu
Multi-core Processor Performance Evaluation Model Based on DPDK Affinity Setting

The general multi-core processor hardware platform combined with DPDK affinity binding technology can effectively reduce the performance overhead caused by CPU interrupts and inter-thread scheduling, and can achieve the performance equal to dedicated network processors. How to construct a model to analyze such processing systems is our focus. This paper combines the DPDK affinity feature on the general multi-core processor platform, establishes a fixed binding relationship between the processing thread and the processor logic core, and then analyzes the distribution features of the multi-core processor nodes after the binding relationship is determined, and builds a queuing model. Finally, the model is analyzed and performance indicators are provided. The model in this paper provides an analytical model for a general-purpose multi-core processor platform with affinity settings, and expressions for key indicators are provided for further research.

Canshuai Wang, Wenjun Zhu, Haocheng Zhou, Zhuang Xu, Peng Li
Parallel Absorbing Diagonal Algorithm: A Scalable Iterative Parallel Fast Eigen-Solver for Symmetric Matrices

In this paper, a scalable parallel eigen-solver called parallel absorbing diagonal algorithm (parallel ADA) is proposed. This algorithm is of significantly improved parallel complexity when compared to traditional parallel symmetric eigen-solver algorithms. The scalability-bottleneck of the traditional eigen-solvers is the tri-diagonalization of a matrix via Householder/Givens transforms. The basic idea of ADA is to avoid the tri-diagonalization completely by iteratively and alternatingly applying two kind of operations in multi-scales: diagonal attaction operations and diagonal absorption operations. In a diagonal attraction operation, it attracts the off-diagonal entries to make the entries near to the diagonal larger in magnitude than the entries far away from the diagonal. In a diagonal absoprtion operation, it absorbs the nearer nonzero entries into the diagonal. Theories of ADA has been established in another paper of ours that for any $$\epsilon >0$$, there exists a constant $$C=C\left( \epsilon \right) $$, such that within C rounds of iterations, the relative error of the algorithm will be reduced to below $$\epsilon $$. Parallel complexity of ADA is analyzed in this paper to reveal its qualitative improvement of scalability.

Junfeng Wu, Hui Zheng, Peng Li
Model Fusion Based Oilfield Production Prediction

Oil production prediction is the main focus of scientific management. During the process of oil exploitation, the production data can be considered to have time series characteristics, which are affected by production plans and geologic conditions, making this time series data complex. To resolve this, this paper tries to make full use of the advantages of different prediction models and proposes model fusion based approach (called TN-Fusion) for production prediction. This approach can effectively extract the temporal and non-temporal features affecting the production, to improve the prediction accuracy through the effective fusion of time series model and non-time series model. Compared with those single model based approach, and non-time series model fusion methods, TN-Fusion has better accuracy and reliability.

Xingjie Zeng, Weishan Zhang, Long Xu, Xinzhe Wang, Jiangru Yuan, Jiehan Zhou
Coverage Path Planning of Penaeus vannamei Feeding Based on Global and Multiple Local Areas

Penaeus vannamei (whiteleg shrimp) has high aquaculture economic benefits, and the high-frequency feeding method is crucial for the rapid growth of shrimp during the breeding process. In order to cope with the high-frequency feeding method, the unmanned surface vehicle can greatly improve feeding efficiency and precision. Furthermore, the labor intensity of the personnel also reduced. The path planning of the unmanned surface vehicle (USV) is a prerequisite for improving feeding efficiency. Based on the actual growth process of shrimp, a two-stage feeding path planning strategy is proposed. In the seedling stage of shrimp, the range of activities is small, and it is necessary to uniformly cover the whole aquaculture area. In this paper, a higher coverage internal and external spiral traversal coverage path planning method is proposed. In the mature stage of shrimp, local area aggregation will be formed because of the larger activity space of shrimp, then it needs to cover and feed the aggregation areas. So we proposed a path planning strategy combining global and multiple local areas coverages, and an improved simulated annealing genetic algorithm is adopted to solve the global path planning. Finally, the application of two different path planning strategies achieves the path planning of the whole growth cycle of the South America shrimp, which improves the feeding efficiency of the bait and reduces the cost of breeding.

XueLiang Hu, Zuan Lin
A Novel Throughput Based Temporal Violation Handling Strategy for Instance-Intensive Cloud Business Workflows

Temporal violations take place during the batch-mode execution of instance-intensive business workflows running in the cloud environments which may significantly affect the QoS (Quality of Service) of cloud workflow system. However, currently most research in the area of workflow temporal QoS focuses on single scientific workflow rather than business workflow with a batch of parallel workflow instances. Therefore, how to handle temporal violations of instance-intensive cloud business workflows is a new challenge. To address such a problem, in this paper, we propose a novel throughput based temporal violation handling strategy. Specifically, firstly we present a definition of throughput based temporal violation handling point to determine where temporal violation handling should be conducted, and secondly we design a new method for adding necessary cloud computing resources for recovering detected temporal violations. Experimental results show that our temporal violation handling strategy can effectively handle temporal violations in cloud business workflow and thus guarantee satisfactory on-time completion rate.

Futian Wang, Xiao Liu, Wei Zhang, Cheng Zhang
Backmatter
Metadaten
Titel
Data Science
herausgegeben von
Prof. Jing He
Dr. Philip S. Yu
Yong Shi
Dr. Xingsen Li
Zhijun Xie
Dr. Guangyan Huang
Jie Cao
Dr. Fu Xiao
Copyright-Jahr
2020
Verlag
Springer Singapore
Electronic ISBN
978-981-15-2810-1
Print ISBN
978-981-15-2809-5
DOI
https://doi.org/10.1007/978-981-15-2810-1