Top

2015 | Book

Read chapter Read first chapter

Intelligent Computation in Big Data Era

International Conference of Young Computer Scientists, Engineers and Educators, ICYCSEE 2015, Harbin, China, January 10-12, 2015. Proceedings

Editors: Hongzhi Wang, Haoliang Qi, Wanxiang Che, Zhaowen Qiu, Leilei Kong, Zhongyuan Han, Junyu Lin, Zeguang Lu

Publisher: Springer Berlin Heidelberg

Book Series : Communications in Computer and Information Science

Part of: Springer Professional "Wirtschaft+Technik" , Springer Professional "Technik" , Springer Professional "Wirtschaft"

About this book

This book constitutes the refereed proceedings of the International Conference of Young Computer Scientists, Engineers and Educators, ICYCSEE 2015, held in Harbin, China, in January 2015. The 61 revised full papers presented were carefully reviewed and selected from 200 submissions. The papers cover a wide range of topics related to intelligent computation in Big Data era, such as artificial intelligence, machine learning, algorithms, natural language processing, image processing, MapReduce, social network.

Frontmatter

Big Data Theory

Theory of the Solution of Inventive Problems (TRIZ) and Computer Aided Design (CAD) for Micro-electro-mechanical Systems (MEMS)

Satellite remote sensing technology is widely used in all walks of life ,which plays an increasingly remarkable results in natural disasters(sudden and major).With more and more launch and application of high resolution satellite, the texture information in remote sensing imagery becomes much more abundant. In the age of big data, for the infrared remote sensor has short life, which annoys many people. In the packaging process, due to a difference in thermal expansion coefficients [2] between the flip-chip bonded MEMS device and the substrate, cooling after bonding can cause the MEMS to buckle. Combine TAIZ theory with the flexure design in CAD to solve the problem. It can be obtained that increasing fold length can reduce warpage. By solving the deformation problem of MEMS devices can facilitate the development of flip chip technology, and make for the further application of the TRIZ theory in the study of remote sensing equipment.

Huiling Yu, Shanshan Cui, Dongyan Shi, Delin Fan, Jie Guo

A Distributed Strategy for Defensing Objective Function Attack in Large-scale Cognitive Networks

Most of existed strategies for defending OFA (Objective Function Attack)are centralized, only suitable for small-scale networks and stressed on the computation complexity and traffic load are usually neglected. In this paper, we pay more attentions on the OFA problem in large-scale cognitive networks, where the big data generated from the network must be considered and the traditional methods could be of helplessness. In this paper, we first analyze the interactive processes between attacker and defender in detail, and then a defense strategy for OFA based on differential game is proposed, abbreviated as DSDG. Secondly, the game saddle point and optimal defense strategy have proved to be existed simultaneously. Simulation results show that the proposed DSDG has a less influence on network performance and a lower rate of packet loss.More importantly, it can cope with the large range OFA effectively.

Guangsheng Feng, Junyu Lin, Huiqiang Wang, Xiaoyu Zhao, Hongwu Lv, Qiao Zhao

An Energy Efficient Random Walk Based Expanding Ring Search Method for Wireless Networks

Wireless networks generate large amount of data. It is important to design energy efficient data search method since power of wireless nodes is finite. Expanding Ring Search (ERS) is a data search technique exploring for targets progressively, which is widely used to locate destinations or information in wireless networks. The existing studies on improving the energy efficiency of ERS cannot work without positioning systems. In this paper, we combine the technique of random walk with ERS, and propose a random walk based expanding ring search method (RWERS) for large-scale wireless networks. RWERS can work without using positioning systems, and improve the energy efficiency of ERS by preventing each node from transmitting the same request more than once using the technique of random walk. We compare RWERS with the optimal ERS strategy and CERS in networks with various shapes of terrains. The simulation results show that RWERS decreases the energy cost by 50% without decreasing in success rate compared with ERS, and has twice the success rate of CERS when the network is sparse. RWERS can be applied to various shapes of terrains better compared with CERS.

Huiqiang Wang, Xiuxiu Wen, Junyu Lin, Guangsheng Feng, Hongwu Lv

Reconstructing White Matter Fiber from Brain DTI for Neuroimage Analysis

Diffusion Tensor Imaging (DTI), which is the magnetic resonance technology, is applied widely, especially in analyzing the brain function and disease. DTI is a four dimensional image with the principle of diffusion of water molecules, which based on the diffusion characteristic of water. In DTI, each voxel has its own value and orientation, which result in the track of movement of water molecules by the FACT algorithm. It is the tracking of brain white matter fiber, while the tractography is the tracking image of neural fiber bundles according to the principle. All images running are big data processing. The tractrography is important for analyzing the function of brain, creating the brain connectivity and analyzing neuroimage with disease. The paper mainly shows the procedure of data processing by registration that has existed in some researches. However, there is a significant comparison with nonlinear-registration and linear-registration in the paper.

Gentian Li, Youxiang Duan, Qifeng Sun

Resolution Method of Six-Element Linguistic Truth-Valued First-Order Logic System

Based on 6-elements linguistic truth-valued lattice implication algebras this paper discusses 6-elements linguistic truth-valued first-order logic system. With some special properties of 6-elements linguistic truth-valued first-order logic, we discussed the satisfiable problem of 6-elements linguistic truth-valued first-order logic and proposed a resolution method of 6-elements linguistic truth-valued first-order logic. Then the resolution algorithm is presented and an example illustrates the effectiveness of the proposed method.

Li Zou, Di Liu, Yingxin Wang, Juan Qu

Graph Similarity Join with K-Hop Tree Indexing

Graph similarity join has become imperative for integrating noisy and inconsistent data from multiple data sources. The edit distance is commonly used to measure the similarity between graphs. To accelerate the similarity join based on graph edit distance, in the paper, we make use of a preprocessing strategy to remove the mismatching graph pairs with significant differences. Then a novel method of building indexes for each graph is proposed by grouping the nodes which can be reached in k hops for each key node with structure conservation, which is the k-hop-tree based indexing method. Experiments on real and synthetic graph databases also confirm that our method can achieve good join quality in graph similarity join. Besides, the join process can be finished in polynomial time.

Yue Wang, Hongzhi Wang, Chen Ye, Hong Gao

Efficient String Similarity Search on Disks

String similarity search is a basic operation for various applications, such as data cleaning, spell checking, bioinformatics and information integration. Memory based q-gram inverted indexes fail to support string similarity search over large scale string datasets due to the memory limitation, and it can no longer work if the data size grows beyond the memory size. In the era of big data, large string dataset are quite common. Existing external memory method, Behm-Index, only supports length-filter and prefix filter. This paper proposes LPA-Index to reduce I/O cost for better query response time, and LPA-Index is a disk resident index which suffers no limitation on data size compared to memory size. LPA-Index supports multiple filters to reduce query candidates effectively, and it adaptively reads inverted lists during query processing for better I/O performance. Experiment results demonstrate the efficiency of LPA-Index and its advantages over existing state-of-art disk index Behm-Index with regard to I/O cost and query response time.

Jinbao Wang, Donghua Yang

A Joint Link Prediction Method for Social Network

The popularity of social network services has caused the rapid growth of the users. To predict the links between users has been recognized as one of the key tasks in social network analysis. Most of the present link prediction methods either analyze the topology structure of social network graph or just concern the user’s interests. These will lead to the low accuracy of prediction. Furthermore, the large amount of user interest information increases the difficulties for common interest extraction. In order to solve the above problems, this paper proposes a joint social network link prediction method-JLPM. Firstly, we give the problem formulation. Secondly, we define a joint prediction feature model(JPFM) to describe user interest topic feature and network topology structure feature synthetically, and present corresponding feature extracting algorithm. JPFM uses the LDA topic model to extract user interest topics and uses a random walk algorithm to extract the network topology features. Thirdly, by transforming the link prediction problem to a classification problem, we use the typical SVM classifier to predict the possible links. Finally, experimental results on

citation

data set show the feasibility of our method.

Xiaoqin Xie, Yijia Li, Zhiqiang Zhang, Shuai Han, Haiwei Pan

HCS: Expanding H-Code RAID 6 without Recalculating Parity Blocks in Big Data Circumstance

This paper introduces a new RAID 6 expanding method HCS, which is facing the circumstance of big data. HCS expands H-Code manner RAID 6. Two key techniques are used to avoid parity blocks’ recalculating. The first one is anti-diagonal data blocks’ selection, and the other one is horizontal data migration. These two techniques ensure the data blocks are retained in the same verification zone, that is horizontal verification zone and anti-diagonal verification zone. Experimental results showed that, compared with SDM, which is also a fast expansion method, HCS can reduce 3.6% expansion time and promote 4.62% performance under four traces.

Shiying Xia, Yu Mao, Minsheng Tan, Weipeng Jing

Efficient Processing of Multi-way Joins Using MapReduce

Multi-way join is critical for many big data applications such as data mining and knowledge discovery. Even though lots of research have been devoted to processing multi-way joins using MapReduce, there are still several problems in general to be further improved, such as transferring numerous unpromising intermediate data and lacking of better coordination mechanisms. This work proposes an efficient multi-way joins processing model using MapReduce, named Sharing-Coordination-MapReduce (SC-MapReduce), which has the functions of sharing and coordination. Our SC-MapReduce model can filter the unpromising intermediate data largely by using the sharing mechanism and optimize the multiple tasks coordination of multi-way joins. Extensive experiments show that the proposed model is efficient, robust and scalable.

Linlin Ding, Siping Liu, Yu Liu, Aili Liu, Baoyan Song

A Selection Algorithm of Service Providers for Optimized Data Placement in Multi-Cloud Storage Environment

The benefits of cloud storage come along with challenges and open issues about availability of services, vendor lock-in and data security, etc. One solution to mitigate the problems is the multi-cloud storage, where the selection of service providers is a key point. In this paper, an algorithm that can select optimal provider subset for data placement among a set of providers in multi-cloud storage architecture based on IDA is proposed, designed to achieve good tradeoff among storage cost, algorithm cost, vendor lock-in, transmission performance and data availability. Experiments demonstrate that it is efficient and accurate to find optimal solutions in reasonable amount of time, using parameters taken from real cloud providers.

Wenbin Yao, Liang Lu

Data-Aware Partitioning Schema in MapReduce

With the advantages of MapReduce programming model in parallel computing and processing of data and tasks on large-scale clusters, a Data-aware partitioning schema in MapReduce for large-scale high-dimensional data is proposed. It optimizes partition method of data blocks with the same contribution to computation in MapReduce. Using a two-stage data partitioning strategy, the data are uniformly distributed into data blocks by clustering and partitioning. The experiments show that the data-aware partitioning schema is very effective and extensible for improving the query efficiency of high-dimensional data.

Liang Junjie, Liu Qiongni, Yin Li, Yu Dunhui

An ICN-Oriented Name-Based Routing Scheme

Information-Centric Networking (ICN) treat contents as the first class citizens and adopt content names for routing. However, ICN faces challenges of big data. The massive content names and location-independent naming bring scalability and efficiency challenges for content addressing. A scalable and efficient name-based routing scheme is a critical component for ICN. This paper proposes a novel Scalable Name-based Geometric Routing scheme, SNGR. To resolve the location-independent names to locations, SNGR utilizes a bi-level sloppy grouping design. To provide scalable location-dependent routing for name resolution, SNGR proposess a universal geometric routing framework. Our theoretical analyses guarantee the performance of SNGR. The experiments by simulation show that SNGR outperformances other similar routing schemes in terms of the scalability of routing table, the reliability to failures, as well as path stretch and latency in name resolution.

Sun Yanbin, Zhang Yu, Zhang Hongli, Fang Binxing, Shi Jiantao

Partial Clones for Stragglers in MapReduce

Stragglers can temporize jobs and reduce cluster efficiency seriously. Many researches have been contributed to the solution, such as Blacklist[8], speculative execution[1, 6], Dolly[8]. In this paper, we put forward a new approach for mitigating stragglers in MapReduce, name Hummer. It starts task clones only for high-risk delaying tasks. Related experiments have been carried and results show that it can decrease the job delaying risk with fewer resources consumption. For small jobs, Hummer also improves job completion time by 48% and 10% compared to LATE and Dolly.

Jia Li, Changjian Wang, Dongsheng Li, Zhen Huang

A Novel Subpixel Curved Edge Localization Method

With the high-speed development of digital image processing technology, machine vision technology has been widely used in automatic detection of industrial products. A large amount of products can be treated by computer instead of human in a shorter time. In the process of automatic detection, edge detection is one of the most commonly used methods. But with the increasing demand for detection precision, traditional pixel-level methods are difficult to meet the requirement, and more subpixel level methods are in the use.

This paper presents a new method to detect curved edge with high precision. First, the target area ratio of pixels near the edge is computed by using one-dimensional edge detection method. Second, parabola is used to approximately represent the curved edge. And we select appropriate parameters to obtain accurate results. This method is able to detect curved edges in subpixel level, and shows its practical effectiveness in automatic measure of products with arc shape in large industrial scene.

Zhengyang Du, Wenqiang Zhang, Jinxian Qin, Hong Lu, Zhong Chen, Xidian Zheng

Maximal Influence Spread for Social Network Based on MapReduce

Due to its importance, influence spread maximization problem for social network has been solved by a number of algorithms. However, when it comes to the scalabilities, existing algorithms are not efficient enough to cope with real-world social networks, which are often big networks. To handle big social networks, we propose parallelized influence spread algorithms. Using Map-Reduce in Hadoop as the platform, we proposed Parallel DAGIS algorithm, a parallel influence spread maximization algorithm. Considering information loss in Parallel DAGIS algorithm, we also develop a Parallel Sampling algorithm and change DFS to BFS during search process. Considering two or even more hops neighbor nodes, we further improve accuracy of DHH. Experimental results show that efficiency has been improved, when coping with big social network, by using Parallel DAGIS algorithm and Parallel Sampling algorithm. The accuracy of DHH has been improved by taking into account more than two hops neighbors.

Qiqi Shi, Hongzhi Wang, Dong Li, Xinfei Shi, Chen Ye, Hong Gao

Implementation of Dijkstra’s Token Circulation on Sensor Network

Sensor networks can consist of large number of sensors. Often, sensors networks use low cost units and thus a subject to malfunctions that can bring the system to inconsistent states. After deployment, the system can be situated in places that are hard to reach and therefore manual reboot operations are undesirable and even unfeasible. Therefore, it is imperative to consider the eventual recovery of arbitrary fault when designing sensor networks. Dijkstra’s algorithm is an important foundation of self-managing computer system and fault-tolerance computing system in distributed systems, since it allows a distributed system to recover from arbitrary starting state within a finite time. The arbitrary starting state ca model arbitrary failure (as long as the code segment stays correct). Another key advantage of Dijkstra’s asynchronous algorithm is that no global clock is needed. This project tests an implementation of Dijkstra’s algorithm using snapshotting techniques that we developed in an earlier work. These sensors can initiate from any state but they come into a consistent one after several cycles of running. We demonstrate the usefulness of our testing technique.

Zhiqiang Ma, Achuan Wang, Jifeng Guo

Co-simulation and Solving of Clapper Type Relay Dynamic Characteristics Based on Multi-software

Based on multi-commercial finite element analysis (FEA) software co-simulation calculating method, the electromagnetic system model was built for solving static and dynamic characteristics of a clap-type rely. Using the Fortran programming language, the solving of differential equation and the calculating of electromagnetic torque interpolation was realized, therefore the MEM coupling system static/dynamic characteristics of the relay was obtained. The validity and accuracy of this method has been confirmed by results of experiments. The conclusions which obtained are valuable in optimizing the clap-type rely production.

Guo Jifeng, Ma Yan, Ma Zhiqiang, Li He, Yu Haoliang

iHDFS: A Distributed File System Supporting Incremental Computing

Big data are always processed repeatedly with small changes, which is a major form of big data processing. The feature of incremental change of big data shows that incremental computing mode can improve the performance greatly. HDFS is a distributed file system on Hadoop which is the most popular platform for big data analytics. And HDFS adopts fixed-size chunking policy, which is inefficient facing incremental computing. Therefore, in this paper, we proposed iHDFS (incremental HDFS), a distributed file system, which can provide basic guarantee for big data parallel processing. The iHDFS is implemented as an extension to HDFS. In iHDFS, Rabin fingerprint algorithm is applied to achieve content defined chunking. This policy make data chunking has much higher stability, and the intermediate processing results can be reused efficiently, so the performance of incremental data processing can be improved significantly. The effectiveness and efficiency of iHDFS have been demonstrated by the experimental results.

Zhenhua Wang, Qingsong Ding, Fuxiang Gao, Derong Shen, Ge Yu

Unstructured Big Data Processing

Chinese Explanatory Segment Recognition as Sequence Labeling

How to mine the underlying reasons for opinions is a key issue on opinion mining. In this paper, we propose a CRF-based labeling approach to explanatory segment recognition in Chinese product reviews. To this end, we first reformulate explanatory segments recognition as a labeling task on a sequence of words, and then explore various features from three linguistic levels, namely character, word and semantic under the framework of conditional random fields. Experimental results over product reviews from mobilephone and car domains show that the proposed approach significantly outperforms existing state-of-the-art methods for explanatory segment extraction.

Yu He, Da Pan, Guohong Fu

An Unsupervised Method for Short-Text Sentiment Analysis Based on Analysis of Massive Data

Common forms of short text are microblogs, Twitter posts, short product reviews, short movie reviews and instant messages. Sentiment analysis of them has been a hot topic. A highly-accurate model is proposed in this paper for short-text sentiment analysis. The researches target microblog, product review and movie reviews. Words, symbols or sentences with emotional tendencies are proved important indicators in short-text sentiment analysis based on massive users’ data. It is an effective method to predict emotional tendencies of short text using these features. The model has noticed the phenomenon of polysemy in single-character emotional word in Chinese and discusses single-character and multi-character emotional word separately. The idea of model can be used to deal with various kinds of short-text data. Experiments show that this model performs well in most cases.

Zhenhua Huang, Zhenrong Zhao, Qiong Liu, Zhenyu Wang

Normalization of Homophonic Words in Chinese Microblogs

Homophonic words are very popular in Chinese microblog, posing a new challenge for Chinese microblog text analysis. However, to date, there has been very little research conducted on Chinese homophonic words normalization. In this paper, we take Chinese homophonic word normalization as a process of language decoding and propose an n-gram based approach. To this end, we first employ homophonic–original word or character mapping tables to generate normalization candidates for a given sentence with homophonic words, and thus exploit n-gram language models to decode the best normalization from the candidate set. Our experimental results show that using the homophonic-original character mapping table and n-grams trained from the microblog corpus help improve performance in homophonic word recognition and restoration.

Xin Zhang, Jiaying Song, Yu He, Guohong Fu

Personalized Web Image Retrieval Based on User Interest Model

The traditional search engines don’t consider that the users interest are different, and they don’t provide personalized retrieval service, so the retrieval efficiency is not high. In order to solve the problem, a method for personalized web image retrieval based on user interest model is proposed. Firstly, the formalized definition of user interest model is provided. Then the user interest model combines the methods of explicit tracking and implicit tracking to improve user’s interest information and provide personalized web image retrieval. Experimental results show that the user interest model can be successfully applied in web image retrieval.

Zhaowen Qiu, Haiyan Chen, Haiyi Zhang

Detection of Underwater Objects by Adaptive Threshold FCM Based on Frequency Domain and Time Domain

According to the characteristics of sonar image data with big data feature, In order to accurately detect underwater objects of sonar image, a novel adaptive threshold FCM (Fuzzy Clustering Algorithm, FCM) based on frequency domain and time domain is proposed. Based on the relationship between sonar image data and big data, Firstly, wavelet de-noising method is used to smooth noise. After de-noising, the sonar image is blocked and each sub-block region is processed by two-dimensional discrete Fourier transform, their maximum amplitude spectrum used as frequency domain character, then time domain of mean and standard deviation, frequency domain of maximum amplitude spectrum are taken for character to complete block k-means clustering, the initial clustering center is determined, after that made use of FCM on sonar image detection, based on clustered image, adaptive threshold is constructed by the distribution of sonar image sea-bottom reverberation region, and final detection results of sonar image are completed. The comparison different experiments demonstrate that the proposed algorithm get good detection precision and adaptability.

Xingmei Wang, Guangyu Liu, Lin Li, Shouxuan Jiang

Discovering Event Regions Using a Large-Scale Trajectory Dataset

The city is facing the unprecedented pressure with the rapid development and the moving population. Some hidden knowledge can be found to service the social with human trajectory data. In this paper, we define a state-ofthe- art concept on fluctuant locations with PCA method and discover the same attribute of fluctuant locations called event with topic model. In the time slice, locations with the same attribute are called event region. Event regions aim to understand the relationship between spatial-temporal locations in the city and to early-warning analyze for the city planning, construction, intelligent navigation, route planning and location based service. We use GeoLife public data to experiment and verify this paper.

Ling Yang, Zhijun Li, Shouxu Jiang

An Evolutional Learning Algorithm Based on Weighted Likelihood for Image Segmentation

Due to the coupling of model parameters, most spatial mixture models for image segmentation can not directly computed by EM algorithm. The paper proposes an evolutional learning algorithm based on weighted likelihood of mixture models for image segmentation. The proposed algorithm consists of multiple generations of learning algorithm, and each stage of learning algorithm corresponds to an EM algorithm of spatially constraint independent mixture model. The smoothed EM result in spatial domain of each stage is considered as the supervision information to guide the next stage clustering. The spatial constraint information is thus incorporated into the independent mixture model. So the coupling problem of the spatial model parameters can be avoided at a lower computational cost. Experiments using synthetic and real images are presented to show the efficiency of the proposed algorithm.

Yu Lin-Sen, Liu Yong-Mei, Sun Guang-Lu, Li Peng

Visualization Analysis for 3D Big Data Modeling

This paper describes an automatic system for 3D big data of face modeling using front and side view images taken by an ordinary digital camera, whose directions are orthogonal. The paper consists of four keys in 3D visualization. Firstly we study the 3D big data of face modeling including feature facial extraction from 2D images. The second part is to represent the technical from Computer Vision, Image Processing and my new method for extract information from images and create 3D model. Thirdly, 3D face modeling based on 2D image software is implemented by C# language, EMGU CV library and XNA framework. Finally, we design experiment, test and record results for measure performance of our method.

TianChi Zhang, Jing Zhang, JianPei Zhang, HaiWei Pan, Kathawach Satianpakiranakorn

MBITP: A Map Based Indoor Target Prediction in Smartphone

This paper presents MBITP, a novel method for an indoor target prediction through the sensor data which may be the Big Data. To predict target, a probability model is presented. In addition, a real-time error correction technique based on map feature is designed to enhance the estimation accuracy. Based on it, we propose an effective prediction algorithm. The practice evaluation shows that the method introduced in this paper has an acceptable performance in real-time target prediction.

Bowen Xu, Jinbao Li

A Method of Automatically Generating 2D Animation Intermediate Frames

This paper proposes the automatic generation of the middle frame and the middle frame automatic coloring method of two-dimensional animation process, users simply given starting key frames and end key frames, According to the algorithm proposed in this paper, the system can automatically generate all key frames that in the middle, and based on the starting key frame and termination of key frame color, the generated in the middle of the frame will been automatically chromatically. The experimental results show that, the automatic generation of intermediate frames and the middle frame automatic coloring method of two-dimensional animation is proposed in this paper production process can be successfully used in animation production, greatly improving the efficiency of animation.

Zhaowen Qiu, Haiyan Chen, Tingting Zhang, Yan Gao

Machine Learning for Big Data

Metric Learning with Relative Distance Constraints: A Modified SVM Approach

Distance metric learning plays an important role in many machine learning tasks. In this paper, we propose a method for learning a Mahanalobis distance metric. By formulating the metric learning problem with relative distance constraints, we suggest a Relative Distance Constrained Metric Learning (RDCML) model which can be easily implemented and effectively solved by a modified support vector machine (SVM) approach. Experimental results on UCI datasets and handwritten digits datasets show that RDCML achieves better or comparable classification accuracy when compared with the state-of-the-art metric learning methods.

Changchun Luo, Mu Li, Hongzhi Zhang, Faqiang Wang, David Zhang, Wangmeng Zuo

An Efficient Webpage Classification Algorithm Based on LSH

With the explosive growth of Internet information, it is more and more important to fetch real-time and related information. And it puts forward higher requirement on the speed of webpage classification which is one of common methods to retrieve and manage information. To get a more efficient classifier, this paper proposes a webpage classification method based on locality sensitive hash function. In which, three innovative modules including building feature dictionary, mapping feature vectors to fingerprints using Locality-sensitive hashing, and extending webpage features are contained. The compare results show that the proposed algorithm has better performance in lower time than the naïve bayes one.

Junjun Liu, Haichun Sun, Zhijun Ding

Semi-supervised Affinity Propagation Clustering Based on Subtractive Clustering for Large-Scale Data Sets

In the face of a growing number of large-scale data sets, affinity propagation clustering algorithm to calculate the process required to build the similarity matrix, will bring huge storage and computation. Therefore, this paper proposes an improved affinity propagation clustering algorithm. First, add the subtraction clustering, using the density value of the data points to obtain the point of initial clusters. Then, calculate the similarity distance between the initial cluster points, and reference the idea of semi-supervised clustering, adding pairs restriction information, structure sparse similarity matrix. Finally, the cluster representative points conduct AP clustering until a suitable cluster division. Experimental results show that the algorithm allows the calculation is greatly reduced, the similarity matrix storage capacity is also reduced, and better than the original algorithm on the clustering effect and processing speed.

Qi Zhu, Huifu Zhang, Quanqin Yang

Gene Coding Sequence Identification Using Kernel Fuzzy C-Mean Clustering and Takagi-Sugeno Fuzzy Model

Sequence analysis technology under big data provides unprecedented opportunities for modern life science. A novel gene coding sequence identification method is proposed in this paper. Firstly, an improved short-time Fourier transform algorithm based on Morlet wavelet is applied to extract the power spectrum of DNA sequence. Then, threshold value determination method based on kernel fuzzy C-mean clustering is used to combine Signal to Noise Ratio (SNR) data of exon and intron into a sequence, classify the sequence into two types, calculate the weighted sum of two SNR clustering centers obtained and the discrimination threshold value. Finally, exon interval endpoint identification algorithm based on Takagi-Sugeno fuzzy identification model is presented to train Takagi-Sugeno model, optimize model parameters with Levenberg-Marquardt least square method, complete model and determine fuzzy rule. To verify the effectiveness of the proposed method, example tests are conducted on typical gene sequence sample data.

Tianlei Zang, Kai Liao, Zhongmin Sun, Zhengyou He, Qingquan Qian

Image Matching Using Mutual k-Nearest Neighbor Graph

Though weighted voting matching is one of most successful image matching methods, each candidate correspondence receives voting score from all other candidates, which can not apparently distinguish correct matches and incorrect matches using voting scores. In this paper, a new image matching method based on mutual k-nearest neighbor (k-nn) graph is proposed. Firstly, the mutual k-nn graph is constructed according to similarity between candidate correspondences. Then, each candidate only receives voting score from its mutual k nearest neighbors. Finally, based on voting scores, the matching correspondences are computed by a greedy ranking technique. Experimental results demonstrate the effectiveness of the proposed method.

Ting-ting Li, Bo Jiang, Zheng-zheng Tu, Bin Luo, Jin Tang

A New Speculative Execution Algorithm Based on C4.5 Decision Tree for Hadoop

As a distributed computing platform, Hadoop provides an effective way to handle big data. In Hadoop, the completion time of job will be delayed by a straggler. Although the definitive cause of the straggler is hard to detect, speculative execution is usually used for dealing with this problem, by simply backing up those stragglers on alternative nodes. In this paper, we design a new Speculative Execution algorithm based on C4.5 Decision Tree, SECDT, for Hadoop. In SECDT, we speculate completion time of stragglers and also of backup tasks, based on a kind of decision tree method: C4.5 decision tree. After we speculate the completion time, we compare the completion time of stragglers and of the backup tasks, calculating their differential value, and selecting the straggler with the maximum differential value to start the backup task. Experiment result shows that the SECDT can predict execution time more accurately than other speculative execution methods, hence reduce the job completion time.

Yuanzhen Li, Qun Yang, Shangqi Lai, Bohan Li

An Improved Frag-Shells Algorithm for Data Cube Construction Based on Irrelevance of Data Dispersion

On-Line Analytical Processing (OLAP) is based on pre-computation of data cubes, which greatly reduces the response time and improves the performance of OLAP. Frag-Shells algorithm is a common method of pre-computation. However, it relies too much on the data dispersion that it performs poorly, when confronts large amount of highly disperse data. As the amount of data grows fast nowadays, the efficiency of data cube construction is increasingly becoming a significant bottleneck. In addition, with the popularity of cloud computing and big data, MapReduce framework proposed by Google is playing an increasingly prominent role in parallel processing. It is an intuitive idea that MapReduce framework can be used to enhance the efficiency of parallel data cube construction. In this paper, by improving the Frag-Shells algorithm based on the irrelevance of data dispersion, and taking advantages of the high parallelism of MapReduce framework, we propose an improved Frag-Shells algorithm based on MapReduce framework. The simulation results prove that the proposed algorithm greatly enhances the efficiency of cube construction.

Dong Li, Zhipeng Gao, Xuesong Qiu, Ran He, Yuwen Hao, Jingchen Zheng

An Availability Evaluation Method of Web Services Using Improved Grey Correlation Analysis with Entropy Difference and Weight

Web services is one of the basic network services, whose availability evaluation is of great significance to the promotion of users’ experience. This paper focuses on the problem of availability evaluation of Web services and proposes a method for availability evaluation of Web services using improved grey correlation analysis with entropy difference and weight (EWGCA).This method is based on grey correlation analysis, and use entropy difference to illustrate the changes of availability, set weight to quantize availability requirements of different operations or transactions in services. Through simulation experiment in high load scenarios for Web services, the experiment result shows that our method can realize hierarchical description and overall evaluation for availability of Web services accurately in the case of smaller test sample volumes or uncertain data even in the field of big data.

Zhanbo He, Huiqiang Wang, Junyu Lin, Guangsheng Feng, Hongwu Lv, Yibing Hu

Equal Radial Force Structure Pressure Sensor Data Analysis and Finite Element Analysis

As big data is very important today, we creative a force sensor with the AT-cut quartz crystal resonator and analyze the experimental data. Quartz crystal resonator has the characteristic that the resonance frequency changes by the external force, which has high precision, fast-speed response. Also it has the superior feature in the temperature and frequency stability. But it also has weakness, because of quartz crystal resonator has low degree of mechanical characteristic and weak to stress concentration by bending that the quartz crystal resonator had been hardly applied to the force measurement.The objective of this study is to construct the sensor mechanism that safely maintains the quartz crystal resonator for the external force with flat structure.We using finite element multiphysics simulation software designed and implemented an innovative structure-equal radial force structure, According to the measured data, applied load equivalent radial force structure between size and the frequency of the quartz monitor chip has a good linear relationship. The proposed force sensor is flat, small, and sensitive. It can be applied to several usages such as medical treatment and contact force detection of human.

Lian-Dong Lin, Cheng-Jun Qiu, Xiang Yu, Peng Zhou

Knowledge Acquisition from Forestry Machinery Patent Based on the Algorithm for Closed Weighted Pattern Mining

The application of big data mining can create over a trillion dollars value. Patents contain a great deal of new technologies and new methods which have unique value in the product innovation. In order to improve the effectiveness of big data mining and aid the innovation of products of forestry machinery, the algorithm for closed weighted pattern mining is applied to acquire the function knowledge in the patents of forestry machinery. Compared with the other algorithms for mining patterns, the algorithm is more suitable for the characteristics of patent data. It not only takes into account the importance of different items to reduce the search space effectively, but also avoids achieving excessive uninteresting patterns below the premise that assures quality. The extensive performance study shows that the patterns which are mined by the closed weighted pattern algorithm are more representative and the acquired knowledge has more realistic application significance.

Huiling Yu, Jie Guo, Dongyan Shi, Guangsheng Chen, Shanshan Cui

Information Propagation with Retweet Probability on Online Social Network

The rapid development of online social network has attracted a lot of research attention. On online social network, people can discuss their ideas, express their interests and opinions, all of which are demonstrated by information propagation. So how to model the information propagation cascade accurately has become a hot topic. In this paper, we firstly incorporate the retweet probability into the traditional propagation models. To find the accurate retweet probability, we introduce the logistic regression model for every user based on the extracted features. With the crawled real dataset, simulation is conducted on the real online social network and moreover some novel results have been obtained. The homogenous retweet probability in the original model has underestimated the speed of information propagation, despite the scale of information propagation is almost at the same level. Besides, the initial information poster is really important for a certain propagation, which enables us to make effective strategies to prevent epidemics of rumor on social network.

Xing Tang, Yining Quan, Qiguang Miao, Ruihong Hou, Kai Deng

A Data Stream Subspace Clustering Algorithm

The main aim of data stream subspace clustering is to find clusters in subspace in rational time accurately. The existing data stream subspace clustering algorithms are greatly influenced by parameters. Due to the flaws of traditional data stream subspace clustering algorithms, we propose SCRP, a new data stream subspace clustering algorithm. SCRP has the advantages of fast clustering and being insensitive to outliers. When data stream changes, the changes will be recorded by the data structure named Region-tree, and the corresponding statistics information will be updated. Further SCRP can regulate clustering results in time when data stream changes. According to the experiments on real datasets and synthetic datasets, SCRP is superior to the existing data stream subspace clustering algorithms on both clustering precision and clustering speed, and it has good scalability to the number of clusters and dimensions.

Xiang Yu, Xiandong Xu, Liandong Lin

Big Data Security

Fine-Grained Access Control for Big Data Based on CP-ABE in Cloud Computing

In Cloud Computing, the application software and the databases are moved to large centralized data centers, where the management of the data and services may not be fully trustworthy. This unique paradigm brings many new security challenges, which have not been well solved. Data access control is an effective way to ensure the big data security in the cloud. In this paper, we study the problem of fine-grained data access control in cloud computing. Based on CP-ABE scheme,we propose a novel access control policy to achieve fine-grainedness and implement the operation of user revocation effectively. The analysis results indicate that our scheme ensures the data security in cloud computing and reduces the cost of the data owner significantly.

Qi Yuan, Chunguang Ma, Junyu Lin

The New Attribute-Based Generalized Signcryption Scheme

An attribute-based generalized signcryption scheme based on bilinear pairing has been proposed. By changing attributes, encryption-only mode, signature-only mode, and signcryption mode can be switch adaptively. It shows that the scheme achieves the semantic security under the decisional bilinear Diffie-Hellman assumption and achieves the unforgeability under the computational Diffie-Hellman assumption. It is more efficient than traditional way and can be used to secure the big data in networks.

Yiliang Han, Yincheng Bai, Dingyi Fang, Xiaoyuan Yang

An Improved Fine-Grained Encryption Method for Unstructured Big Data

In the big data protecting technologies, most of the existing data protections adopt entire encryption that leads to the researches of lightweight encryption algorithms, without considering from the protected data itself. In our previous paper (FGEM), it finds that not all the parts of a data need protections, the entire data protection can be supplanted as long as the critical parts of the structured data are protected. Reducing unnecessary encryption makes great sense for raising efficiency in big data processing. In this paper, the improvement of FGEM makes it suitable to protect semi-structured and unstructured data efficiently. By storing semi-structured and unstructured datum in an improved tree structure, the improved FGEM for the datum is achieved by getting congener nodes. The experiments show the improved FGEM has short operating time and low memory consumption.

Changli Zhou, Chunguang Ma, Songtao Yang

Education Track

Research and Reflection on Teaching of C Programming Language Design

C Language is a basic programming language and a compulsory foundation course for majors of science and engineering. Subjected to teaching periods, students have no time to do enough concrete integrated exercises. In teaching of C programming language design, both teachers and students have some problems. On the one hand teachers should change their teaching ideology, on the other hand, students should spontaneously study and improve their study motivation. In this way, students will improve their programming ability and apply what they learn comprehensively.

Hui Gao, Zhaowen Qiu, Di Wu, Liyan Gao

A Peer Grading Tool for MOOCs on Programming

In massive open online courses (MOOCs), peer grading will play an important role to promote MOOCs development. In this paper, we develop a peer grading tool for programming courses on MOOCs. It is capable of dealing with large and diverse student population, and providing them with targeted subjective assessment. This tool firstly partition the submissions into small chunks to reduce the task of reviewers and give us flexibility to scale the code review process. Next we use code normalization and chunks clustering to assign similar chunks to the same student for increasing reviewer efficiency. Besides, the tool use a random allocation strategy and workload classification to assure reviewers workload balance while every student can get diverse feedback. Finally our evaluation experiments on a number of students in school indicate that the tool has achieved a significant improvement over the peer grading on MOOCs.

Zhaodong Wei, Wenjun Wu

DIPP—An LLC Replacement Policy for On-chip Dynamic Heterogeneous Multi-core Architecture

As the big data era is coming, it brings new challenges to the massive data processing. A combination of GPU and CPU on chip is the trend to release the pressure of large scale computing. We found that there are different memory access characteristics between GPU and CPU. The most important one is that the programs of GPU include a large number of threads, which lead to higher access frequency in cache than the CPU programs. Although the LRU policy favors the programs with high memory access frequency, the programs of GPU can’t get the corresponding performance boost even more cache resources are provided. So LRU policy is not suitable for heterogeneous multi-core processor.

Based on the different characteristics of GPU and CPU programs on memory access, this paper proposes an LLC dynamic replacement policy–DIPP (Dynamic Insertion / Promotion Policy) for heterogeneous multi-core processors. The core idea of the replacement policy is to reduce the miss rate of the program and enhance the overall system performance by limiting the cache resources that GPU can acquire and reducing the thread interferences between programs.

Experiments compare the DIPP replacement policy with LRU and we conduct a classified discussion according to the program results of GPU. Friendly programs enhance 23.29% on the average performance (using arithmetic mean). Large working sets programs can improve 13.95%, compute-intensive programs enhance 9.66% and stream class programs improve 3.8%.

Zhang Yang, Xing Zuocheng, Ma Xiao

Data Flow Analysis and Formal Method

Exceptions are those abnormal data flow which needs additional calculation to deal with. Exception analysis concerned abnormal flow contains a lot of research content, such as exception analysis method, program verification. This article introduces another research direction of exception analysis which based on formal method. The article analyses and summarizes those research literatures referring exception analysis and exception handling logic verification based on formal reasoning and model checking. In the article, we provide an overview of the relationship and difference between traditional ideas and formal method concerning program exception analysis. In the end of the article, we make some ideas about exception analysis based on formal semantic study of procedure calls. Exception handling is seen as a special semantic effect of procedures calls.

Yanmei Li, Shaobin Huang, Junyu Lin, Ya Li

10-Elements Linguistic Truth-Valued Intuitionistic Fuzzy First-Order Logic System

This paper presents 10-elements linguistic truth-valued intuitionistic fuzzy algebra and the properties based on the linguistic truth-valued implication algebra which is fit to express both comparable and incomparable information. This method can also deal with the uncertain problem which has both positive evidence and negative evidence at the same time.10-elements linguistic truth-valued intuitionistic fuzzy first-order logic system has been established in the intuitionistic fuzzy algebra.

Yingxin Wang, Xin Wen, Li Zou

Industry Track

The Research of Heartbeat Detection Technique for Blade Server

The blade servers have been widely used in the telecommunications, financial and other big data processing fields as for the high efficiency, stability and autonomy. This study takes the hot redundancy design for dual-BMC management of blade servers as the research project, and puts forward a heartbeat detection program utilizing I2C for transmission of IPMI commands. And it’s successfully applied to the blade servers to achieve a hot redundancy, monitor and management of master/slave BMC management module, which is more standardized, reliable, and easy to implement.

Weiwei Jiang, Naikuo Chen, Shihua Geng

The Monitoring with Video and Alarm System Based on Linux Router

As one of important safeguarded measures in security field, the video monitoring system have been applied in extensive ways such as building’s security, information’s acquisition and medical treatment. In current time, the intelligent home systems as well as the Internet of Things are forming a more and more mature industrial chain. With abundant of new technology applications, this industrial chain is developing towards to a more intelligent way. One of the most important applications is the video monitoring system developed by new technology. Deploy an appropriate design of video monitoring system in the several fields such as public security, traffic schedule, systems control and so on, is a major technological medium which can guarantee not wasting an amount of manpower resource as well as keeping the system in a good running status. This article based on the preponderance of the Linux system, especially the character of open source of it. Combine the Linux system and some particular routers; designing a real-time and efficient video monitoring system. It’s functional and in the meantime it can discriminate abnormal information or the situation automatically and then raise the alarm. This system use matured B/S mode in software’s framework, which is also lightweight class. In addition, this system manages the expenses in hardware cost more commendably.

Qiqing Huang, Yan Chen, Honge Ren, Zhaowen Qiu

The Design and Implementation of an Encrypted U-Disk Based on NFC

With the popularity of USB3.0, consumers favor a mobile storage which is convenient and efficient. But in the same time, people often pay less attention to the Safety of traditional mobile storage, especially the U-disk, the security of which is almost ineffective. Challenged by severe test of current information security situation, designing a safe and reliable way to protect the user’s privacy encryption has gradually attracted both the equipment manufacturers and the user’s attention. As we can see among them, as a quick and secure identification technology, NFC gains a great application in mobile payment and authentication field. It also means a great momentum to popularize. Its unique convenience and safety guarantee in the applications has a wonderful prospect in replacing traditional security solutions. This paper discusses a actualization of a new encrypt U disk which is applied with the NFC technology, to understand the principles of NFC technology and its practical application of encryption, this paper can play a guiding significance.

Yunhang Zhang, Chang Hou, Yan Chen, Qiqing Huang, Honge Ren, Zhaowen Qiu

An On-chip Interconnection QoS Verification Platform of Processor of Large Data for Architectural Modeling Analysis

This paper presents introduction for a QoS verification of on-chip interconnection based on the new progress of the industry, which combined with an AMD processor chip design for big data. Some verification experience in architectural modeling and simulation of on-chip interconnection is also introduced in this paper.

Li Qinghua, Qin Jilong, Ding Xu, Wang Endong, Gong Weifeng

Demo Track

Paradise Pointer : A Sightseeing Scenes Images Search Engine Based on Big Data Processing

Nowadays, with the rapid development of network, more and more people are willing to share their attractive photos on the internet, especially the sightseeing spot photos. However, there are countless noteless but splendid scenic spots that remain unknown to most people, and it is a pity that one finds a wonderful place but cannot reach it. Therefore, it is meaningful and useful to build an image search website used specially for sightseeing spot images. Meanwhile, we have stepped into the new era of Big Data, the data we need to process is in an explosive growth, including the images. Thus we develop ParadisePointer, a scenery image search engine, which used the processing method on the background of Big Data. In this paper, we are going to introduce the main stages of our system and some key features of ParadisePointer.

Jie Pan, Hongzhi Wang, Hong Gao, Wenxuan Zhao, Hongxing Huo, Huirong Dong

Chinese MOOC Search Engine

MOOC (Massive Open Online Courses) has become more and more popular all over the world in recent years. However, search engines, such as Google, Baidu, Yahoo and Bing, do not support specialized MOOC courses searching. The purpose of this demo is to present a vertical search engine designed to retrieve MOOC courses for learner. The demo search engine obtains MOOC web pages by a focused Crawler. And the pages are parsed into structure or unstructure data with a modeling-based Parser. Then the Indexer build index for the data by Lucene. Finally, the extraction MOOC list is made by Course_ranking and Retrieval. The demo search engine is accessible at http://www.MOOCsoso.com.

Bo An, Tianwei Qu, Haoliang Qi, Tianwei Qu

HawkEyes Plagiarism Detection System

The high-obfuscation plagiarism detection in big data environment, such as the paraphrasing and cross-language plagiarism, is often difficult for anti-plagiarism system because the plagiarism skills are becoming more and more complex. This paper proposes HawkEyes, a plagiarism detection system implemented based on the source retrieval and text alignment algorithms which developed for the international competition on plagiarism detection organized by CLEF. The text alignment algorism in HawkEyes gained the first place in PAN@CLEF2012. In the demonstration, we will present our system implemented on PAN@CLEF2014 training data corpus.

Leilei Kong, Jie Li, Feng Zhao, Haoliang Qi, Zhongyuan Han, Yong Han, ZhiMao Lu

LRC Sousou: A Lyrics Retrieval System

Lyrics retrieval is one of the frequently-used retrieval functions of search engines. However, diversified information requirements are neglected in the existing lyrics retrieval systems. A lyrics retrieval system named LRC Sousou, in which erroneous characters are corrected automatically, the mixed queries of Chinese words and Pinyin are supported, and English phonemes queries are also achieved effectively, is introduced in this paper. The technologies of natural language processing, information retrieval and machine learning algorithm are applied to our lyrics retrieval system which enhance the practicability and efficiency of lyrics search, and improve user experience.

Yong Han, Li Min, Yu Zou, Zhongyuan Han, Song Li, Leilei Kong, Haoliang Qi, Wenhao Qiao, Shuo Cui, Hong Deng

Reduce the Shopping Distance: Map Region Search Based on High Order Voronoi Diagram

Many people would like to purchase items using location-based services to find the suitable stores in daily life. Although there are many online map search engines giving isolated Point-of-Interest as query results according to the correlation between isolated stores and the query, this interaction is difficult in meeting the shopping needs of people with disabilities, who would usually prefer shopping in one single location to avoid inconvenience in transportation. In this article, we propose a framework of map search service using Region-of-Interest (ROI) as the query result, which can greatly reduce users shopping distance among multiple stores. High order Voronoi diagram is used to reduce the time complexity of Region-of-Interests generation. Experimental results show that our method is both efficient and effective.

Zhi Yu, Can Wang, Jiajun Bu, Mengni Zhang, Zejun Wu, Chun Chen

Applications of Bootstrap in Radar Signal Processing

The bootstrap technique is a powerful method for assessing the accuracy of parameters estimator, that have been widely applied on statistical and signal processing problems. A novel program based on bootstrap for DOA estimation is performed to compared with different number of snapshots in this paper. We have resampled the received signals for 200-1000 times to create new data, therefore the arrival angle is estimated by the music algorithm in the conditions of confidence interval. The demo results show that higher estimation probability and smaller mean square error can be achieved in the situation of fewer snapshots received by passive radar system than that of traditional algorithm.

Lei Liu, Dandan Fu, Yupu Zhu, Dan Su, Ming Diao

SAR Image Quality Assessment System Based on Human Visual Perception for Aircraft Electromagnetic Countermeasures

In electronic confrontation, Synthetic Aperture Radar (SAR) is vulnerable to different types of electronic jamming. The research on SAR jamming image quality assessment can provide the prerequisite for SAR jamming and anti-jamming technology, which is an urgent problem that researchers need to solve. Traditional SAR image quality assessment metrics analyze statistical error between the reference image and the jamming image only in the pixel domain; therefore, they cannot reflect the visual perceptual property of SAR jamming images effectively. In this demo, we develop a SAR image quality assessment system based on human visual perception for the application of aircraft electromagnetic countermeasures simulation platform. The internet of things and cloud computing techniques of big data are applied to our system. In the demonstration, we will present the assessment result interface of the SAR image quality assessment system.

Jiajing Wang, Dandan Fu, Tao Wang, Xiangming An

A Novel Multivariate Polynomial Approximation Factorization of Big Data

In actual engineering, processing of big data sometimes requires building of mass physical models, while processing of physical model requires relevant math model, thus producing mass multivariate polynomials, the effective reduction of which is a difficult problem at present. A novel algorithm is proposed to achieve the approximation factorization of complex coefficient multivariate polynomial in light of characteristics of multivariate polynomials. At first, the multivariate polynomial is reduced to be the binary polynomial, then the approximation factorization of binary polynomial can produce irreducible duality factor, at last, the irreducible duality factor is restored to the irreducible multiple factor. As a unit root is cyclic, selecting the unit root as the reduced factor can ensure the coefficient does not expand in a reduction process. Chinese remainder theorem is adopted in the corresponding reduction process, which brought down the calculation complexity. The algorithm is based on approximation factorization of binary polynomial and calculation of approximation Greatest Common Divisor, GCD. The algorithm can solve the reduction of multivariate polynomials in massive math models, which can obtain effectively null point of multivariate polynomials, providing a new approach for further analysis and explanation of physical models. The experiment result shows that the irreducible factors from this method get close to the real factors with high efficiency.

Guotao Luo, Guang Pei

Backmatter

Title: Intelligent Computation in Big Data Era
Editors: Hongzhi Wang
Haoliang Qi
Wanxiang Che
Zhaowen Qiu
Leilei Kong
Zhongyuan Han
Junyu Lin
Zeguang Lu
Publisher: Springer Berlin Heidelberg
Electronic ISBN: 978-3-662-46248-5
Print ISBN: 978-3-662-46247-8
DOI: https://doi.org/10.1007/978-3-662-46248-5

Springer Professional